elasticsearch: how to index terms which are stopwords only?

后端未结

关注

 1  2040

I had much success building my own little search with elasticsearch in the background. But there is one thing I couldn\'t find in the documentation.

I\'m indexing th

相关标签:

1条回答

慢半拍i

2021-01-15 02:47

You can use the synonym filter to convert The The into a single token eg thethe which won't be removed by the stopwords filter.

First, configure the analyzer:

curl -XPUT 'http://127.0.0.1:9200/test/?pretty=1'  -d '
{
   "settings" : {
      "analysis" : {
         "filter" : {
            "syn" : {
               "synonyms" : [
                  "the the => thethe"
               ],
               "type" : "synonym"
            }
         },
         "analyzer" : {
            "syn" : {
               "filter" : [
                  "lowercase",
                  "syn",
                  "stop"
               ],
               "type" : "custom",
               "tokenizer" : "standard"
            }
         }
      }
   }
}
'

Then test it with the string "The The The Who".

curl -XGET 'http://127.0.0.1:9200/test/_analyze?pretty=1&text=The+The+The+Who&analyzer=syn' 

{
   "tokens" : [
      {
         "end_offset" : 7,
         "position" : 1,
         "start_offset" : 0,
         "type" : "SYNONYM",
         "token" : "thethe"
      },
      {
         "end_offset" : 15,
         "position" : 3,
         "start_offset" : 12,
         "type" : "<ALPHANUM>",
         "token" : "who"
      }
   ]
}

"The The" has been tokenized as "the the", and "The Who" as "who" because the preceding "the" was removed by the stopwords filter.

To stop or not to stop

Which brings us back to whether we should include stopwords or not? You said:

I know I can ignore the stop words list completely 
but this is not what I want since the results searching 
for other bands like "the who" would explode.

What do you mean by that? Explode how? Index size? Performance?

Stopwords were originally introduced to improve search engine performance by removing common words which are likely to have little effect on the relevance of a query. However, we've come a long way since then. Our servers are capable of much more than they were back in the 80s.

Indexing stopwords won't have a huge impact on index size. For instance, to index the word the means adding a single term to the index. You already have thousands of terms - indexing the stopwords as well won't make much difference to size or to performance.

Actually, the bigger problem is that the is very common and thus will have a low impact on relevance, so a search for "The The concert Madrid" will prefer Madrid over the other terms. This can be mitigated by using a shingle filter, which would result in these tokens:

['the the','the concert','concert madrid']

While the may be common, the the isn't and so will rank higher.

You wouldn't query the shingled field by itself, but you could combine a query against a field tokenized by the standard analyzer (without stopwords) with a query against the shingled field.

We can use a multi-field to analyze the text field in two different ways:

curl -XPUT 'http://127.0.0.1:9200/test/?pretty=1'  -d '
{
   "mappings" : {
      "test" : {
         "properties" : {
            "text" : {
               "fields" : {
                  "shingle" : {
                     "type" : "string",
                     "analyzer" : "shingle"
                  },
                  "text" : {
                     "type" : "string",
                     "analyzer" : "no_stop"
                  }
               },
               "type" : "multi_field"
            }
         }
      }
   },
   "settings" : {
      "analysis" : {
         "analyzer" : {
            "no_stop" : {
               "stopwords" : "",
               "type" : "standard"
            },
            "shingle" : {
               "filter" : [
                  "standard",
                  "lowercase",
                  "shingle"
               ],
               "type" : "custom",
               "tokenizer" : "standard"
            }
         }
      }
   }
}
'

Then use a multi_match query to query both versions of the field, giving the shingled version more "boost"/relevance. In this example the text.shingle^2 means that we want to boost that field by 2:

curl -XGET 'http://127.0.0.1:9200/test/test/_search?pretty=1'  -d '
{
   "query" : {
      "multi_match" : {
         "fields" : [
            "text",
            "text.shingle^2"
         ],
         "query" : "the the concert madrid"
      }
   }
}
'

0 讨论(0)