Solr: combining EdgeNGramFilterFactory and NGramFilterFactory

前端 未结 2 789
我在风中等你
我在风中等你 2021-02-09 13:59

I have a situation where I need to use both EdgeNGramFilterFactory and NGramFilterFactory.

I am using NGramFilterFactory to perform a \"contains\" style search with min

相关标签:
2条回答
  • 2021-02-09 14:34

    You don't necessarily have to do all this in the same field. I would create a different fields using different custom types for each treatment so that you can apply the logic separately.

    In the following:

    • text contains the original tokens, minimally processed;
    • text_ngram uses the NGramFilter for your two-character-minimum tokens
    • text_first_letter uses EdgeNGram for your one-character initial-letter tokens

    If you're processing all text fields in this way, then you might be able to get away with using a copyField to populate the fields. Otherwise, you can instruct your Solr client to send in the same field values for the three separate field types.

    When searching, include all of them in your searches with the qf parameter.

    <fieldType name="text" class="solr.TextField" positionIncrementGap="100">
      <analyzer>
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StandardFilterFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldType>
    
    <fieldType name="text_ngram" class="solr.TextField" positionIncrementGap="100">
      <analyzer>
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StandardFilterFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.NGramFilterFactory" minGramSize="2" maxGramSize="15" side="front"/>
      </analyzer>
    </fieldType>
    
    <fieldType name="text_first_letter" class="solr.TextField" positionIncrementGap="100">
      <analyzer>
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StandardFilterFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="1" side="front"/>
      </analyzer>
    </fieldType>
    

    Setting up field and dynamicField definitions are left up to you. Or let me know if you have more questions and I can edit with clarifications.

    0 讨论(0)
  • 2021-02-09 14:35

    Start by applying the EdgeNgramFilter with min = 1 and max = 1000 (we want the entire original token to be included). Example:

    hello => 'h', 'he', 'hel', 'hell', 'hello'

    Secondly use the NGramFilter with min = 2. (I will use 2 as the max in the example for simplicity)

    'h', 'he', 'hel', 'hell', 'hello' => 'h', 'he', 'he', 'el', 'he', 'el', 'll', 'he', 'el', 'll', 'lo'

    Now you will have several identical tokens since you have applied the NGramFilter on all "partial" tokens from the EdgeNGramFilter but simply apply the RemoveDuplicatesTokensFilter to remove those.

    'h', 'he', 'he', 'el', 'he', 'el', 'll', 'he', 'el', 'll', 'lo' => 'h', 'he', 'el', 'll', 'lo'

    Now your field will support a single char "startsWith" query and a multiple chars "contains" query.

    0 讨论(0)
提交回复
热议问题