How to configure solr dataimport handler to parse wikipedia xml document?

房东的猫 提交于 2019-12-13 04:32:21


So this is what I have done so far.

I have added a request handler in solrconfig.xml as follows:

<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">
    <lst name="defaults">
        <str name="config">wiki-data-config.xml</str>

In the same configuration directory I have created a file wiki-data-config.xml which contains the following,

    <dataSource type="FileDataSource" encoding="UTF-8" />
        <entity name="page"
                flatten="true" >
            <field column="id"        xpath="/mediawiki/page/id" />
            <field column="title"     xpath="/mediawiki/page/title" />
            <field column="revision"  xpath="/mediawiki/page/revision/id" />
            <field column="user"      xpath="/mediawiki/page/revision/contributor/username" />
            <field column="userId"    xpath="/mediawiki/page/revision/contributor/id" />
            <field column="text"      xpath="/mediawiki/page/revision/text" />
            <field column="timestamp" xpath="/mediawiki/page/revision/timestamp" dateTimeFormat="yyyy-MM-dd'T'hh:mm:ss'Z'" />
            <field column="$skipDoc"  regex="^#REDIRECT .*" replaceWith="true" sourceColName="text"/>

And my schema.xml contains the following,

<!-- Tanny edit starts -->

<field name="id"        type="int"  indexed="true" stored="true" required="true"/>
<field name="title"     type="string"  indexed="true" stored="false"/>
<field name="revision"  type="int"    indexed="true" stored="true"/>
<field name="user"      type="string"  indexed="true" stored="true"/>
<field name="userId"    type="int"     indexed="true" stored="true"/>
<field name="text"      type="text_en"    indexed="true" stored="false"/>
<field name="timestamp" type="date"    indexed="true" stored="true"/>
<field name="titleText" type="text_en"    indexed="true" stored="true"/>
<copyField source="title" dest="titleText"/>

<!-- Tanny edit ends -->

Now after restarting the SOLR, I try to post the WikiMedia XML Data using the ./bin/post script in the following way,

tanny@localhost:~/binaries/solr-5.2.1$ ./bin/post -c core-base-wiki /home/tanny/Downloads/Data/Wiki/enwiki-20150702-stub-articles8.xml

And it prints the following in the console

/usr/lib/jvm/java-7-oracle-cloudera//bin/java -classpath /home/tanny/binaries/solr-5.2.1/dist/solr-core-5.2.1.jar -Dauto=yes -Dc=core-base-wiki -Ddata=files org.apache.solr.util.SimplePostTool /home/tanny/Downloads/Data/Wiki/enwiki-20150702-stub-articles8.xml
SimplePostTool version 5.0.0
Posting files to [base] url http://localhost:8983/solr/core-base-wiki/update...
Entering auto mode. File endings considered are xml,json,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log
POSTing file enwiki-20150702-stub-articles8.xml (application/xml) to [base]
1 files indexed.
COMMITting Solr index changes to http://localhost:8983/solr/core-base-wiki/update...
Time spent: 0:00:00.863

However, when I go to the UI and check for the overview it says 0 documents indexed. I am at a loss to understand what configuration I am missing out on. Any help/guidance will be higly appreciated.

P.S.: The dataset enwiki-20150702-stub-articles8.xml is downloaded from WikiMedia Page. Few sample lines from the document are mentioned as follows,

<mediawiki xmlns="" xmlns:xsi="" xsi:schemaLocation="" version="0.10" xml:lang="en">
    <generator>MediaWiki 1.26wmf11</generator>
      <namespace key="-2" case="first-letter">Media</namespace>
      <namespace key="829" case="first-letter">Module talk</namespace>
      <namespace key="2600" case="first-letter">Topic</namespace>
    <title>700 (number)</title>
      <comment>Disambiguated: [[Tintin]] → [[The Adventures of Tintin]]</comment>
      <text id="669059875" bytes="12464" />
    <title>Canadian federal election, 1957</title>
      <comment>/* Impact */ clarify</comment>
      <text id="671713242" bytes="77788" />
    <title>Professional Players Tournament (snooker)</title>
    <redirect title="World Open (snooker)" />
      <comment>Robot: Fixing double redirect to [[World Open (snooker)]]</comment>
      <text id="360810125" bytes="34" />


The data got indexed after I tried to ingest using the command: "curl http://localhost:8983/solr/core-base-wiki/dataimport?command=full-import".

Somehow the ./bin/post was not able to do the same. Didn't research more on the same, if anyone else has figured out how to, you are requested to share your findings.


You're missing lib element in solrconfig.xml.

<lib dir="../../../dist" regex="solr-dataimporthandler-.*\.jar" />

