Parser for Wikipedia

 ̄綄美尐妖づ 提交于 2019-12-20 12:14:07

问题


I downloaded a Wikipedia dump and I want to convert the wiki format into my object format. Is there a wiki parser available that converts the object into XML?


回答1:


See java-wikipedia-parser. I have never used it but according to the docs :

The parser comes with an HTML generator. You can however control the output that is being generated by passing your own implementation of the be.devijver.wikipedia.Visitor interface.




回答2:


I do not know how exactly looks xml format of Wikipedia dump. But, if a part of the text is in Wikipedia markup, I suggest to investigate http://lucene.apache.org/java/3_0_2/api/contrib-wikipedia/org/apache/lucene/wikipedia/analysis/WikipediaTokenizer.html. This is one of the classes of a Wikipedia package for apache lucene. I didn't use it but apache lucene is a quite mature project, so it is worth to try its -- in this case experimental -- package.




回答3:


The JWPL parser analyzes the structure of a text with MediaWiki markup and represents it as a Java object. This allows for structured access to the contents of e.g. Wikipedia or Wiktionary. There is no standalone release of the parser, as it is part of the JWPL Wikipedia API release. However, it can be used perfectly without accessing Wikipedia with JWPL.

http://code.google.com/p/jwpl/wiki/JWPLParser




回答4:


This might help: a page with converters from mediawiki to other formats, including docbook. Docbook is a standard xml based format that might fit your needs (xml representation of mediawiki content)




回答5:


You can use a wide range of tools to parse your contents. All script languages have modules. For example Perl language have Text::Markup::Trac which is the Trac wiki syntax parser for Text::Markup. It generates an HTML file.




回答6:


Wiki Parser converts Wikipedia dumps into parsed XML. Might be exactly what you need.




回答7:


u could try wikiprep it's a perl wikipedia parser check it's page

it outputs many files some of them are

1- wikipedia parsed into XML 2- cat-hier file , which contains wikipedia categories hierarchy

i've tried it and it's very useful it's only problem that it needs high memory available for processing most probably more than 4gb RAM also u can download a preprepared XML version from here which is available also on the page



来源:https://stackoverflow.com/questions/3888063/parser-for-wikipedia

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!