Parser for Wikipedia | 易学教程

问题

I downloaded a Wikipedia dump and I want to convert the wiki format into my object format. Is there a wiki parser available that converts the object into XML?

回答1:

See java-wikipedia-parser. I have never used it but according to the docs :

The parser comes with an HTML generator. You can however control the output that is being generated by passing your own implementation of the be.devijver.wikipedia.Visitor interface.

回答2:

I do not know how exactly looks xml format of Wikipedia dump. But, if a part of the text is in Wikipedia markup, I suggest to investigate http://lucene.apache.org/java/3_0_2/api/contrib-wikipedia/org/apache/lucene/wikipedia/analysis/WikipediaTokenizer.html. This is one of the classes of a Wikipedia package for apache lucene. I didn't use it but apache lucene is a quite mature project, so it is worth to try its -- in this case experimental -- package.

回答3:

The JWPL parser analyzes the structure of a text with MediaWiki markup and represents it as a Java object. This allows for structured access to the contents of e.g. Wikipedia or Wiktionary. There is no standalone release of the parser, as it is part of the JWPL Wikipedia API release. However, it can be used perfectly without accessing Wikipedia with JWPL.

http://code.google.com/p/jwpl/wiki/JWPLParser

回答4:

This might help: a page with converters from mediawiki to other formats, including docbook. Docbook is a standard xml based format that might fit your needs (xml representation of mediawiki content)

回答5:

You can use a wide range of tools to parse your contents. All script languages have modules. For example Perl language have Text::Markup::Trac which is the Trac wiki syntax parser for Text::Markup. It generates an HTML file.

回答6:

Wiki Parser converts Wikipedia dumps into parsed XML. Might be exactly what you need.

回答7:

u could try wikiprep it's a perl wikipedia parser check it's page

it outputs many files some of them are

1- wikipedia parsed into XML 2- cat-hier file , which contains wikipedia categories hierarchy

i've tried it and it's very useful it's only problem that it needs high memory available for processing most probably more than 4gb RAM also u can download a preprepared XML version from here which is available also on the page

来源：https://stackoverflow.com/questions/3888063/parser-for-wikipedia

标签

java

mediawiki

nlp

nsxmlparser

wikipedia