solr facet search truncate words

问题

have a solr configured for french content. Search is fine, but when i activate facet search, words are truncated in a special way.

All e disappear, for eg automobil instead of automobile, montagn instead of montagne, styl instead of style , homm => homme etc....

<lst name="keywords">
    <int name="automobil">1</int>
    <int name="citroen">1</int>
    <int name="minist">0</int>
    <int name="polit">0</int>
    <int name="pric">0</int>
    <int name="shinawatr">0</int>
    <int name="thailand">0</int>
</lst

here is the query q=fulltextfield:champpions&facet=true&facet.field=keywords

the keyword content :

<arr name="keywords">
    <str>Ski</str>
    <str>sport</str>
    <str>Free style</str>
    <str>automobile</str>
    <str>Rallye</str>
    <str>Citroen</str>
    <str>montagne</str>
</arr>

here is the schema used :

<fieldtype name="text_fr" class="solr.TextField">
  <analyzer type="index">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.StandardFilterFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_fr.txt"/>
    <filter class="solr.WordDelimiterFilterFactory" preserveOriginal="1" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="0" />
    <filter class="solr.ISOLatin1AccentFilterFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.SnowballPorterFilterFactory" language="French"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.StandardFilterFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_fr.txt"/>
    <filter class="solr.WordDelimiterFilterFactory" preserveOriginal="1" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1" />
    <filter class="solr.ISOLatin1AccentFilterFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.SnowballPorterFilterFactory" language="French"/>
  </analyzer>
</fieldtype>

the field def :

If somebody have an idea about that issue....

Thanks for your answer. regards Jerome longet

回答1:

Generally, if you want to use a field as a facet, it should be stored as a string.

You're faceting on a tokenized and filtered field, so the individual values are the processed words in your keywords field.

回答2:

All above said is correct, I just want to add one thing one facets. The facet values are the indexed terms, and not the stored ones. One recommendation for facets is to use a string-type. This is often a good choice. But sometimes you would like to to some things to your facet terms. In that case, you can use a text type, but treat the input only lightly. Avoid in any case your above choices of Stemming (SnowballPorter) or WordDelimiter.

A good choice to start with is KeywordTokenizerFactory, you could to PatternReplace to clean up your terms and input, and do a TrimFilter at the end. Don't do lowercasing, if your users are going to see the terms.

An example, my input are alphabetic language codes. The PatternReplace clean up non-alphabetic characters, the second correct an input-mistake:

  <analyzer>
     <tokenizer class="solr.KeywordTokenizerFactory" />
     <filter class="solr.LowerCaseFilterFactory" />
     <filter class="solr.PatternReplaceFilterFactory"
             pattern="([^a-z])" 
             replacement="" 
             replace="all" />
     <filter class="solr.PatternReplaceFilterFactory"
             pattern="fer|xxx"
             replacement="und"
             replace="all" />
     <filter class="solr.LengthFilterFactory" min="3" max="3" />
  </analyzer>

Have fun with solr

Oliver

来源：https://stackoverflow.com/questions/12697897/solr-facet-search-truncate-words

标签

solr

truncate

faceted-search