问题
I have configured a dih-import.xml as shown below. The FileListEntityProcessor
walks through some folders and then executes a XPathEntity and a DB-Entity for each file.
When I executed a full import for ~30.000 files, the import took almost 3 hours. Back to the DIH-debug console it showed me, that for the first file that was found 2 db-calls were made, for the 2nd 4, then 6, 8, ..
google didn't show me anything on this subject, so I am hoping for you :)
Thanks in advance
<?xml version="1.0" encoding="UTF-8"?>
<dataConfig>
<dataSource
name="cr-db"
jndiName="xyz"
type="JdbcDataSource" />
<dataSource
name="cr-xml"
type="FileDataSource"
encoding="utf-8" />
<document name="doc">
<entity
dataSource="cr-xml"
name="f"
processor="FileListEntityProcessor"
baseDir="/path/to/xml"
filename="*.xml"
recursive="true"
rootEntity="true"
onError="skip">
<entity
name="xml-data"
dataSource="cr-xml"
processor="XPathEntityProcessor"
forEach="/root"
url="${f.fileAbsolutePath}"
transformer="DateFormatTransformer"
onError="skip">
<field column="id" xpath="/root/id" />
<field column="A" xpath="/root/a" />
</entity>
<entity
name="db-data"
dataSource="cr-db"
query="
SELECT
id, b
FROM
a_table
WHERE
id = '${f.file}'">
<field column="B" name="b" />
</entity>
</entity>
</document>
</dataConfig>
EDIT found the problem at google, but no answer there either: http://osdir.com/ml/solr-user.lucene.apache.org/2010-04/msg00138.html
and another edit
updated solr from 3.6 to 4.1 and executed the importer. The problem still exists, only that there are not 2n (2, 4, 6, 8, ..) calls for the sub-entities anymore but only n.
回答1:
If the main issue is the number of hits on the Database when you use JdbcDataSource, you may try switching to CachedSqlEntityProcessor.
You may also want to track SOLR-2943, as they want to address exactly your problem, hopefully for upcoming Solr 4.2
来源:https://stackoverflow.com/questions/15164166/solr-filelistentityprocessor-is-executing-sub-entities-multiple-times