问题
have a solr configured for french content. Search is fine, but when i activate facet search, words are truncated in a special way.
All e disappear, for eg automobil instead of automobile, montagn instead of montagne, styl instead of style , homm => homme etc....
<lst name="keywords">
<int name="automobil">1</int>
<int name="citroen">1</int>
<int name="minist">0</int>
<int name="polit">0</int>
<int name="pric">0</int>
<int name="shinawatr">0</int>
<int name="thailand">0</int>
</lst
here is the query q=fulltextfield:champpions&facet=true&facet.field=keywords
the keyword content :
<arr name="keywords">
<str>Ski</str>
<str>sport</str>
<str>Free style</str>
<str>automobile</str>
<str>Rallye</str>
<str>Citroen</str>
<str>montagne</str>
</arr>
here is the schema used :
<fieldtype name="text_fr" class="solr.TextField">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_fr.txt"/>
<filter class="solr.WordDelimiterFilterFactory" preserveOriginal="1" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="0" />
<filter class="solr.ISOLatin1AccentFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory" language="French"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_fr.txt"/>
<filter class="solr.WordDelimiterFilterFactory" preserveOriginal="1" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1" />
<filter class="solr.ISOLatin1AccentFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory" language="French"/>
</analyzer>
</fieldtype>
the field def :
If somebody have an idea about that issue....
Thanks for your answer. regards Jerome longet
回答1:
Generally, if you want to use a field as a facet, it should be stored as a string.
You're faceting on a tokenized and filtered field, so the individual values are the processed words in your keywords field.
回答2:
All above said is correct, I just want to add one thing one facets. The facet values are the indexed terms, and not the stored ones. One recommendation for facets is to use a string-type. This is often a good choice. But sometimes you would like to to some things to your facet terms. In that case, you can use a text type, but treat the input only lightly. Avoid in any case your above choices of Stemming (SnowballPorter) or WordDelimiter.
A good choice to start with is KeywordTokenizerFactory, you could to PatternReplace to clean up your terms and input, and do a TrimFilter at the end. Don't do lowercasing, if your users are going to see the terms.
An example, my input are alphabetic language codes. The PatternReplace clean up non-alphabetic characters, the second correct an input-mistake:
`
<analyzer>
<tokenizer class="solr.KeywordTokenizerFactory" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.PatternReplaceFilterFactory"
pattern="([^a-z])"
replacement=""
replace="all" />
<filter class="solr.PatternReplaceFilterFactory"
pattern="fer|xxx"
replacement="und"
replace="all" />
<filter class="solr.LengthFilterFactory" min="3" max="3" />
</analyzer>
`
Have fun with solr
Oliver
来源:https://stackoverflow.com/questions/12697897/solr-facet-search-truncate-words