Removing Solr duplicate values into multivalued field

后端 未结 7 1105
情书的邮戳
情书的邮戳 2021-02-19 03:13

My Solr index contains a multivalued field with duplicate values. How can I remove the duplicates ?

Is it possible to overwrite duplicate values into the multivalued fie

相关标签:
7条回答
  • 2021-02-19 03:37

    Really late to the party, but the top answer did not work for me in Solr 6.0 for attempting to add a duplicate entry on a multivalued field. it was missing a processor right before UniqFieldsUpdateProcessorFactory. So adding something like this to my solrconfig.xml worked:

    <updateRequestProcessorChain name="uniq-fields">
    <processor class="org.apache.solr.update.processor.DistributedUpdateProcessorFactory"/>
    <processor class="org.apache.solr.update.processor.UniqFieldsUpdateProcessorFactory">
      <str name="fieldName">YourFieldA</str>
      <str name="fieldName">yourFieldB</str>
    </processor>
    <processor class="solr.RunUpdateProcessorFactory" />
    

    Where YourFieldA and YourFieldB are defined fields in your schema.xml. Note that you must also add this to the proper requestHandler ie:

      <requestHandler name="/update" class="solr.UpdateRequestHandler" >
    <lst name="defaults">
      <str name="update.chain">uniq-fields</str>
    </lst>
    

    This will not only prevent duplicates from being added, but also remove all duplicates from your index upon update for the specified fields.

    0 讨论(0)
  • 2021-02-19 03:41

    In latest version of solr you can use add-distinct while doing atomic updates to multivalued fields.

    add-distinct: Adds the specified values to a multiValued field, only if not already present. May be specified as a single value, or as a list.

    (ref: https://lucene.apache.org/solr/guide/8_8/updating-parts-of-documents.html)

    0 讨论(0)
  • 2021-02-19 03:44

    This configuration works for Solr 5.3.1

    <updateRequestProcessorChain name="distinct-values" default="true">
        <processor class="solr.DistributedUpdateProcessorFactory"/>
        <processor class="solr.UniqFieldsUpdateProcessorFactory">
            <str name="fieldName">field1</str>
            <str name="fieldName">field2</str>
        </processor>
        <processor class="solr.RunUpdateProcessorFactory" />
    </updateRequestProcessorChain>  
    
    0 讨论(0)
  • 2021-02-19 03:51

    I am using solrJ to bind documents, and to avoid duplicated values I defined my multivalued field as a HashSet.

    @Field("description")
    public Collection<String> description = new HashSet<>();
    
    0 讨论(0)
  • 2021-02-19 03:51

    Or you could handle it in Solr, but in an UpdateRequestProcessor so that it happens before indexing and you don't need to learn about analysis chain.

    You can use java or a number of scripting languages with the ScriptUpdateProcessor

    0 讨论(0)
  • 2021-02-19 03:54

    I was struggling to accomplish the same. This worked for me. Add the below processor to your solrconfig.xml

    <updateRequestProcessorChain name="deduplicateMultiValued" default="true">
            <processor class="org.apache.solr.update.processor.UniqFieldsUpdateProcessorFactory">
                <lst name="fields">
                    <str>multivaluedFieldXYZ</str>
                </lst>
            </processor>
            <processor class="solr.RunUpdateProcessorFactory" />
     </updateRequestProcessorChain>
    
    0 讨论(0)
提交回复
热议问题