Multi-word synonym search in Solr

廉价感情. 提交于 2019-12-04 17:21:11

Yes sadly this is a well known problem due to how the Solr query parser breaks up on whitespace before analyzing. So instead of seeing "spider" before "man" in the token stream, you instead simply see each word on its own. Just "spider" with nothing before/after and just "man" with nothing before/after.

This is because most Solr query forms see a space as basically an "OR". Search for "spider OR man" instead of looking at the full text, analyzing it to generate synonyms, then generating a query from that.

For more background, there's this blog post

There's a large number of solutions to this problem, including the following:

  • hon-lucene-synonyms. This plugin runs an analyzer before generating an edismax query over multiple fields. It's a bit of a blackbox, and I've found it can generate some complex query forms that generate weird performance and relevance bugs.
  • Lucidwork's autophrase query parser By selectively autophrasing, this plugin lets you specify key phrases (spider man) that should not be broken into OR queries and can have synonym expansion applied
  • OpenSource Connection's Match query parser. Searches a single field using a query-specified analyzer run before the field is searched. Also searches multi-word synonyms as phrases. My favorite, but disclaimer: I'm the author :)
  • Rene Kriegler's Querqy -- Querqy is a Solr plugin for query preprocessing rules. These rules can identify your key phrases and rewrite the query to non-multiterm form.
  • Roll your own: Learn to write your own query parser plugin and handle the problem however you want.

My usually strategy for this kind of problem is to use the synonym filter not to expand a search to include all of the possible synonyms, but to normalize to a single form. I do this both in my index and query field analysis.

For example, with this line in my fieldType/analyzer block in schema.xml:

<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="false"/>

(Note the expand="false")
...and this line in my synonyms.txt:

spiderman, spider man, Mary Jane => peter

This way I make sure that any of these four values will be indexed and searched as "peter". For example, if the source document mentions "The Amazing Spider Man" it will be indexed as "The Amazing peter". When a user searches for "Mary Jane" it will search for "peter" instead, so it will match.

The important thing here is that because "Mary" is not one of the comma-separated synonyms, it won't be changed if it appears without "Jane" following. So searching for "Mary is amazing" will actually search for "Mary is amazing", and it will not match the document.

One of the important details, is that I choose a normalized form (e.g. "peter") that is only one word. I could organize it this way:

peter, spiderman, spider man => Mary Jane

but because Mary Jane is two words, it may (depending on other features of my search), match the two words separately as well as together. By choosing a single word form to normalize into, I make sure that my tokenizer won't try to break it up.

It's a known limitation within Solr / Lucene. Essentially you would have to provide an alternative form of tokenization so that specific space delimited words (i.e. phrases) are treated as single words. One way of achieving this is to do this client side - i.e. in your application that is calling Solr, when indexing, keep a list of synonym phrases and find / replace those phrase values with an alternative (for example removing the spaces or replacing it with a delimiter that isn't treated as a token boundary).

E.g. if you have "Hello There" as a phrase you want to use in a synonym, then replace it with "HelloThere" when indexing.

Now in your synonyms.txt file you can have (for example):

Hi HelloThere Wotcha => Hello

Similarly when you search, replace any incidences of "Hello There" in the query string with HelloThere and then they will be matched as a synonym of Hello.

Alternatively, you could use the AutoPhraseTokenFilter that LucidWorks created, available on github. This works by maintaining a token stream so that it can work out if a combination of two or more sequential tokens matches one of the synonym phrases, and if it doesn't, it throws away the first token as not matching the phrase. I'm not sure how much overhead this adds, but it seems a good approach - would be nice to have by default in Solr as part of the SynonymFilter.

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!