I have been trying to get my Solr schema (using Solr 1.3.0) to create terms that are tokenized by whitespace and punctuation. Here are some examples on what I would like to see happen:
terms given -> terms tokenized
foo-bar -> foo,bar
one2three4 -> one2three4
multiple words/and some-punctuation -> multiple,words,and,some,punctuation
I thought that this combination would work:
<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"/>
</analyzer
<fieldType>
The problem is that this results in the following for letter to number transitions:
one2three4 -> one,2,three,4
I have tried various combinations of WordDelimiterFilterFactory
settings, but none have proven useful. Is there a filter or tokenizer that can handle what I require?
how about
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" splitOnNumerics="0" />
that should prevent one2three4 to be split
来源:https://stackoverflow.com/questions/3891054/how-can-i-set-up-solr-to-tokenize-on-whitespace-and-punctuation