Solr: FileListEntityProcessor is executing sub entities multiple times

烈酒焚心 提交于 2019-12-13 05:22:09

问题


I have configured a dih-import.xml as shown below. The FileListEntityProcessor walks through some folders and then executes a XPathEntity and a DB-Entity for each file.

When I executed a full import for ~30.000 files, the import took almost 3 hours. Back to the DIH-debug console it showed me, that for the first file that was found 2 db-calls were made, for the 2nd 4, then 6, 8, ..

google didn't show me anything on this subject, so I am hoping for you :)

Thanks in advance

<?xml version="1.0" encoding="UTF-8"?>
<dataConfig>
    <dataSource 
        name="cr-db"
        jndiName="xyz"
        type="JdbcDataSource" />
    <dataSource 
        name="cr-xml" 
        type="FileDataSource" 
        encoding="utf-8" />


    <document name="doc">
        <entity 
            dataSource="cr-xml" 
            name="f" 
            processor="FileListEntityProcessor" 
            baseDir="/path/to/xml" 
            filename="*.xml" 
            recursive="true" 
            rootEntity="true" 
            onError="skip">
            <entity
                name="xml-data" 
                dataSource="cr-xml" 
                processor="XPathEntityProcessor" 
                forEach="/root" 
                url="${f.fileAbsolutePath}" 
                transformer="DateFormatTransformer" 
                onError="skip">
                <field column="id" xpath="/root/id" /> 

                <field column="A" xpath="/root/a" />
            </entity>

            <entity 
                name="db-data" 
                dataSource="cr-db"
                query="
                    SELECT  
                        id, b
                    FROM 
                        a_table
                    WHERE 
                        id = '${f.file}'">
                <field column="B" name="b" />
            </entity>
        </entity>
    </document>
</dataConfig>

EDIT found the problem at google, but no answer there either: http://osdir.com/ml/solr-user.lucene.apache.org/2010-04/msg00138.html


and another edit

updated solr from 3.6 to 4.1 and executed the importer. The problem still exists, only that there are not 2n (2, 4, 6, 8, ..) calls for the sub-entities anymore but only n.


回答1:


If the main issue is the number of hits on the Database when you use JdbcDataSource, you may try switching to CachedSqlEntityProcessor.

You may also want to track SOLR-2943, as they want to address exactly your problem, hopefully for upcoming Solr 4.2



来源:https://stackoverflow.com/questions/15164166/solr-filelistentityprocessor-is-executing-sub-entities-multiple-times

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!