Solr - synonyms containing multiple words

[亡魂溺海] 提交于 2019-11-29 12:47:18

问题


Quick question, I don't know how to deal with synonyms which contains a space! I have the following config:

The SOLR config file

<fieldType ... >
  <analyzer type="index">
    <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
            <filter class="solr.WordDelimiterFilterFactory" 
                            catenateWords="1" 
                            preserveOriginal="1"
                            splitOnCaseChange="1"
                            generateWordParts="1" 
                            generateNumberParts="1"         
                            catenateNumbers="1" 
                            catenateAll="1" 
                            />
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="30" side="front"/>
  </analyzer>
  <analyzer type="query">    
    <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.LengthFilterFactory" min="2" max="70" />
    <filter class="solr.SynonymFilterFactory" synonyms="syn.txt" ignoreCase="true" expand="true"/>
    <filter class="solr.LowerCaseFilterFactory"/>
 </analyzer>
</fieldType>

My file: syn.txt

st., st => saint
istambul => istanbul
airport, apt => aéroport
NYC => New York
pt., pt => port
brussels => bruxelles

Everything was working fine except the synonym:

"NYC => New York"

I did some research and i found the following:

Keep in mind that while the SynonymFilter will happily work with synonyms containing multiple words (ie: "sea biscuit, sea biscit, seabiscuit")

The recommended approach for dealing with synonyms like this, is to expand the synonym when indexing. This is because there are two potential issues that can arrise at query time:

The Lucene QueryParser tokenizes on white space before giving any text to the Analyzer, so if a person searches for the words sea biscit the analyzer will be given the words "sea" and "biscit" separately, and will not know that they match a synonym.

Phrase searching (ie: "sea biscit") will cause the QueryParser to pass the entire string to the analyzer, but if the SynonymFilter is configured to expand the synonyms, then when the QueryParser gets the resulting list of tokens back from the Analyzer, it will construct a MultiPhraseQuery that will not have the desired effect.

This is because of the limited mechanism available for the Analyzer to indicate that two terms occupy the same position: there is no way to indicate that a "phrase" occupies the same position as a term.

For our example the resulting MultiPhraseQuery would be "(sea | sea | seabiscuit) (biscuit | biscit)" which would not match the simple case of "seabiscuit" occuring in a document

So I tried to changed my config file and to add my filters at the indexing but it is not working.

Did something have some ideas?


回答1:


You are doing explicit mapping with =>.

The Solr documentation says

Explicit mappings match any token sequence on the LHS of "=>" and replace with all alternatives on the RHS. These types of mappings ignore the expand parameter in the schema.

So I am guessing that if you search for NYC you get nothing back, since it got replaced with New York at index time.

Instead, can you try declaring them as equivalent synonyms? i.e. like NYC, New York instead of NYC => New York.

Then I believe you can search for either of them and the result will be the same.




回答2:


The problem is that solr synonyms tend to cause issues when the number of words in the first phrase is less than the number of words in the second phrase. When this happens, tokens overflow into the positions of other tokens.

I have a workaround for this problem, but it requires two uses of solr.SynonymFilterFactory at index and query time.

Like this :

<filter class="solr.SynonymFilterFactory" synonyms="multi_word_conversion.txt" 
ignoreCase="true" expand="true" />

<filter class="solr.SynonymFilterFactory" synonyms="layor_two_syns.txt" 
ignoreCase="true" expand="true"/>

In the first filter you will have: New York => New_York

In the second filter: NYC => New_York

Now a search for New York will return results containing NYC and vice verses.

On a final note: This will method will not work unless it is at index and query time.




回答3:


About

st., st => saint

I think you should do it that way :

st. => saint
st => saint

About

NY => New York

I'm facing a similar issue and came to the conclusion that it's because parsing is done BEFORE synonym replacement, which is likely causing a problem when multi word. I found that it is possible to include a parser into SynonymFactory :

<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true" tokenizerFactory="solr.KeywordTokenizerFactory" /> 

I just tested it I got much better results, but not yet the one expected. Strangely enough, when adding KeywordTokenizerFactory seems to impact positvely, adding WhitespaceTokenizerFactory or StandardTokenizerFactory doesn't seem to change anything.

BTW, if not using shingles, this should already be fine.




回答4:


basing on Pr Shadoko's answer:

Look the way your analyzer works, e.g. with

http://localhost/solr/analysis/field?analysis.fieldvalue=EXAMPLE-KEYWORDS&q=EXAMPLE-KEYWORD%203&analysis.fieldname=EXAMPLEFIELD&analysis.showmatch=true

analysis/field is an out-of-the-box request handler (seated in solrconfig.xml). Here you find its parameter list. ("analysis.query" doesn't work for me, so I had to use "q")

As the SynonymFilter parse (cuts) the incoming text BEFORE matching any synonym, the multi-word synonyms won't get a hit. The trick is to tell the SynonymFilter to take a parser, which actually doesn't parse: the keywordTokenizer

<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true" tokenizerFactory="solr.KeywordTokenizerFactory" />

Anyhow, this approach feels like a hack and I can't estimate the side-effects (scalability, ...) - so be careful!



来源:https://stackoverflow.com/questions/12217024/solr-synonyms-containing-multiple-words

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!