问题
I have some items in my index (Solr. 4.4), which contain names like Foobar 135g
, where the 135g refers to some weights. Searching for foobar
or foobar 135
does work, but when I try to search for the exact phrase foobar 135g
, nothing is found.
I analysed the query inside the solr admin panel "Analysis". Here everything looks good. The fields are indexed correctly, the query is splitted correctly, and I get hits (indicated by this purple background on the tokens).
But there has to be an issue the way I process the strings on index and/or query time. So this is the field definition, I'm using:
<fieldType name="text" class="solr.TextField" omitNorms="false">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.WordDelimiterFilterFactory" catenateWords="1" catenateAll="1" preserveOriginal="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="30"/>
<filter class="solr.ReverseStringFilterFactory" />
<filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="30"/>
<filter class="solr.ReverseStringFilterFactory" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.WordDelimiterFilterFactory" catenateWords="1" catenateAll="1" preserveOriginal="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
I'm using the two ReverseStringFilterFactory
's with the EdgeNGramFilterFactory
's to be able to search for foob
and for bar
or obar
(strings that appear at the end of an item name). First I thought, it has something to do with the WordDelimiterFilterFactory
and the catenateWords
options. But this option doesn't do anything with numbers in it (am I right?).
After reading the documentation (http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters) I found generateNumberParts
which default is 1
. This leads to splitting 135g
into 135
and g
. But as long as I have the preserveOriginal
option enabled, the 135g
is also indexed as a whole string. This is also shown in the Analysis panel from the admin interface:
Does anybody know what kind of filter, tokenizer... is causing this issue?
UPDATE
I've found out something interesting. When I debug the query for the search 135g
, I get the following debug output:
<lst name="debug">
<str name="rawquerystring">name_texts:135g</str>
<str name="querystring">name_texts:135g</str>
<str name="parsedquery">MultiPhraseQuery(name_texts:"(135g 135) (g 135g)")</str>
<str name="parsedquery_toString">name_texts:"(135g 135) (g 135g)"</str>
<lst name="explain"/>
<str name="QParser">LuceneQParser</str>
...
</lst>
I understand, that because of the earlier mentioned solr.WordDelimiterFilterFactory
, the string get's splitted into this parts. But why is Solr converting it into a MultiPhraseQuery
? I'm a little bite confused right now, I thought that every single token generated by the solr.WordDelimiterFilterFactory
on query time would trigger a seperated search (or at least, a OR
statement between the tokens).
Please, someone clear up my mind, I'm kinda confused ;) How can I avoid this?
回答1:
It is the WordDelimiterFilterFactory. You should be able to see it in your admin panel under analysis. To not do that use : splitOnNumerics="0" as attribute.
Update:
Read more about it here: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters.
solr.WordDelimiterFilterFactory
Creates solr.analysis.WordDelimiterFilter.
Splits words into subwords and performs optional transformations on subword groups. By default, words are split into subwords with the following rules:
splitOnNumerics="1" causes alphabet => number transitions to generate a new part [Solr 1.3]: "j2se" => "j" "2" "se" default is true ("1"); set to 0 to turn off
Update 2
Based on your latest comment, i now understood what you meant. I took your field type definition and indexed on solr4.5.1 with your sentence and was able to search for test_mytext:"foobar 135g" , test_mytext:foobar 135g, test_mytext:foobar 135g , test_mytext:foobar , test_mytext:135g, test_mytext:135. where test_mytext is of type you defined in your question above. So i do not know why you are unable to find in your own index. Make sure your field is defined some thing like this: <field name="text" type="mytext" indexed="true" stored="true"/>
Upadate 3 Here is my debug log, with your field definition, not sue why you are seeing completely different processing: Query => test_mytext:135g debug": { "rawquerystring": "test_mytext:135g", "querystring": "test_mytext:135g", "parsedquery": "test_mytext:135g test_mytext:135 test_mytext:g test_mytext:135g", "parsedquery_toString": "test_mytext:135g test_mytext:135 test_mytext:g test_mytext:135g", "explain": { "200": "\n0.8563627 = (MATCH) product of:\n 1.141817 = (MATCH) sum of:\n 0.35407978 = (MATCH) weight(test_mytext:135g in 1) [DefaultSimilarity], result of:\n 0.35407978 = score(doc=1,freq=2.0 = termFreq=2.0\n), product of:\n 0.45980635 = queryWeight, product of:\n 3.4849067 = idf(docFreq=2, maxDocs=36)\n 0.13194223 = queryNorm\n 0.77006286 = fieldWeight in 1, product of:\n 1.4142135 = tf(freq=2.0), with freq of:\n 2.0 = termFreq=2.0\n 3.4849067 = idf(docFreq=2, maxDocs=36)\n 0.15625 = fieldNorm(doc=1)\n 0.4336574 = (MATCH) weight(test_mytext:135 in 1) [DefaultSimilarity], result of:\n 0.4336574 = score(doc=1,freq=3.0 = termFreq=3.0\n), product of:\n 0.45980635 = queryWeight, product of:\n 3.4849067 = idf(docFreq=2, maxDocs=36)\n 0.13194223 = queryNorm\n 0.94313055 = fieldWeight in 1, product of:\n 1.7320508 = tf(freq=3.0), with freq of:\n 3.0 = termFreq=3.0\n 3.4849067 = idf(docFreq=2, maxDocs=36)\n 0.15625 = fieldNorm(doc=1)\n 0.35407978 = (MATCH) weight(test_mytext:135g in 1) [DefaultSimilarity], result of:\n 0.35407978 = score(doc=1,freq=2.0 = termFreq=2.0\n), product of:\n 0.45980635 = queryWeight, product of:\n 3.4849067 = idf(docFreq=2, maxDocs=36)\n 0.13194223 = queryNorm\n 0.77006286 = fieldWeight in 1, product of:\n 1.4142135 = tf(freq=2.0), with freq of:\n 2.0 = termFreq=2.0\n 3.4849067 = idf(docFreq=2, maxDocs=36)\n 0.15625 = fieldNorm(doc=1)\n 0.75 = coord(3/4)\n" },
I am using solr 4.5.1 .
Update 4 Then i noticed that you are using Solr 4.4.0. I took your exact field definition and phrase and ran a query and it finds your result.
Query => name_texts:"135g"
Result:
<result name="response" numFound="1" start="0">
<doc>
<str name="id">100</str>
<str name="name_texts">Foobar 135g</str>
<long name="_version_">1456487722571005952</long></doc>
</result>
<lst name="debug">
<str name="rawquerystring">name_texts:"135g"</str>
<str name="querystring">name_texts:"135g"</str>
<str name="parsedquery">MultiPhraseQuery(name_texts:"(135g 135) (g 135g)")</str>
<str name="parsedquery_toString">name_texts:"(135g 135) (g 135g)"</str>
Your processing looks correct and it find result in my instance. I first thought you had extra
, but looks like is not causing issue in my local instance. The best place to look for these issues is to use the admin analysis page and debug queries, which you are already doing. I can not think of any thing else as i am unable to reproduce. Do yourself a favor by just taking a clean instance of solr with only change to schema.xml for your field definition and index just this through admin panel (documents) => {"id":"100","name_texts":"Foobar 135g"} . Run this query http://localhost:8983/solr/collection1/select?q=name_texts%3A%22135g%22&wt=xml&indent=true&debugQuery=true
来源:https://stackoverflow.com/questions/20884338/solr-cant-search-for-numbers-mixed-with-characters