问题
Our current production index size is 1.5 TB with 3 shards. Currently we have the following field type:
<fieldType name="text_ngram" class="solr.TextField" positionIncrementGap="100">
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="index">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.CustomNGramFilterFactory" minGramSize="3" maxGramSize="30" preserveOriginal="true"/>
</analyzer>
</fieldType>
And the above field type is working well for the US and English language clients.Now we have some new Chinese and Japanese client ,so after googling--
http://www.basistech.com/indexing-strategies-for-multilingual-search-with-solr-and-rosette/ https://docs.lucidworks.com/display/lweug/Multilingual+Indexing+and+Search
for best approach for multilingual index,there seems to be pros/cons associated with every approach.Then i tried RnD with a single field approach and here's my new field type:
<fieldType name="text_multi" class="solr.TextField" positionIncrementGap="100">
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.CJKWidthFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.CJKBigramFilterFactory"/>
</analyzer>
<analyzer type="index">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.CJKWidthFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.CJKBigramFilterFactory"/>
<filter class="solr.CustomNGramFilterFactory" minGramSize="3" maxGramSize="30" preserveOriginal="true"/>
</analyzer>
</fieldType>
I have kept the same tokenizer, only changed the filters.And it is working well with all existing search /use-case for English documents as well as new use case for Chinese/Japanese documents.
Now i have the following questions to the Solr experts/gurus:
- Is this a correct approach to do it? Or i'm missing something?
- Can you give me an example where there will be problem with this above new field type? A use-case/scenario with example will be very helpful.
- Also is there any problem in future with different clients coming up?
Please provide some guidance or best strategy.
回答1:
I had the field Type as below
<fieldType name="text_reference" class="solr.TextField" sortMissingLast="true" omitNorms="true" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="50" side="front"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="50" side="back"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
I did not find any issue with it for any language. I have verified it with french, German, chinese, Japanese, Arabic, polish, finnish etc..
I find the one you are using currently should not have any issue with any language(i didn't analysed your fieldType in the solr analysis tool).
If you have found any issue with your current fieldType named "text_ngram" please share then it would help me in to put more analysis.
Otherwise I suggest you to go with the current one.
One more thing, if you change the field type you have to consider the re-index of existing index as there is change in the schema.
来源:https://stackoverflow.com/questions/30108688/solr-multilingual-indexing-with-one-field