Elastic synonym usage in aggregations

问题

Situation :

Elastic version used: 2.3.1

I have an elastic index configured like so

PUT /my_index
{
  "settings": {
    "analysis": {
      "filter": {
        "my_synonym_filter": {
          "type": "synonym", 
          "synonyms": [ 
            "british,english",
            "queen,monarch"
          ]
        }
      },
      "analyzer": {
        "my_synonyms": {
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "my_synonym_filter" 
          ]
        }
      }
    }
  }
}

Which is great, when I query the document and use a query term "english" or "queen" I get all documents matching british and monarch. When I use a synonym term in filter aggregation it doesnt work. For example

In my index I have 5 documents, 3 of them have monarch, 2 of them have queen

POST /my_index/_search
{
  "size": 0,
  "query" : {
      "match" : {
         "status.synonym":{
            "query": "queen",
            "operator": "and"
         }
      }
   },
     "aggs" : {
        "status_terms" : {
            "terms" : { "field" : "status.synonym" }
        },
        "monarch_filter" : {
            "filter" : { "term": { "status.synonym": "monarch" } }
        }
    },
   "explain" : 0
}

The result produces:

Total hits:

5 doc count (as expected, great!)
Status terms: 5 doc count for queen (as expected, great!)
Monarch filter: 0 doc count

I have tried different synonym filter configuration:

queen,monarch
queen,monarch => queen
queen,monarch => queen,monarch

But the above hasn't changed the results. I was wanting to conclude that maybe you can use filters at query time only but then if terms aggregation is working why shouldn't filter, hence I think its my synonym filter configuration that is wrong. A more extensive synonym filter example can be found here.

QUESTION:

How to use/configure synonyms in filter aggregation?

Example to replicate the case above: 1. Create and configure index:

PUT /my_index
{
  "settings": {
    "analysis": {
      "filter": {
        "my_synonym_filter": {
          "type": "synonym",
          "synonyms": [
            "wlh,wellhead=>wellwell"
          ]
        }
      },
      "analyzer": {
        "my_synonyms": {
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "my_synonym_filter"
          ]
        }
      }
    }
  }
}

PUT my_index/_mapping/job
{
  "properties": {
    "title":{
      "type": "string",
      "analyzer": "my_synonyms"
    }
  }
}

2.Put two documents:

PUT my_index/job/1
{
    "title":"wellhead smth else"
}

PUT my_index/job/2
{
    "title":"wlh other stuff"
}

3.Execute a search on wlh which should return 2 documents; have a terms aggregation which should have 2 documents for wellwell and a filter which shouldn't have 0 count:

POST my_index/_search
{
  "size": 0,
  "query" : {
      "match" : {
         "title":{
            "query": "wlh",
            "operator": "and"
         }
      }
   },
     "aggs" : {
        "wlhAggs" : {
            "terms" : { "field" : "title" }
        },
        "wlhFilter" : {
            "filter" : { "term": { "title": "wlh"     } }
        }
    },
   "explain" : 0
}

The results of this query is:

   {
   "took": 8,
   "timed_out": false,
   "_shards": {
      "total": 5,
      "successful": 5,
      "failed": 0
   },
   "hits": {
      "total": 2,
      "max_score": 0,
      "hits": []
   },
   "aggregations": {
      "wlhAggs": {
         "doc_count_error_upper_bound": 0,
         "sum_other_doc_count": 0,
         "buckets": [
            {
               "key": "wellwell",
               "doc_count": 2
            },
            {
               "key": "else",
               "doc_count": 1
            },
            {
               "key": "other",
               "doc_count": 1
            },
            {
               "key": "smth",
               "doc_count": 1
            },
            {
               "key": "stuff",
               "doc_count": 1
            }
         ]
      },
      "wlhFilter": {
         "doc_count": 0
      }
   }
}

And thats my problem, the wlhFilter should have at least 1 doc count in it.

回答1:

I'm short in time, so if needed I can elaborate a bit more at a later time today/tomorrow. But the following should work:

DELETE /my_index
PUT /my_index
{
  "settings": {
    "analysis": {
      "filter": {
        "my_synonym_filter": {
          "type": "synonym", 
          "synonyms": [ 
            "british,english",
            "queen,monarch"
          ]
        }
      },
      "analyzer": {
        "my_synonyms": {
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "my_synonym_filter" 
          ]
        }
      }
    }
  },
  "mappings": {
    "test": {
      "properties": {
        "title": {
          "type": "text",
          "analyzer": "my_synonyms",
          "fielddata": true
        }
      }
    }
  }
}
POST my_index/test/1
{
  "title" : "the british monarch"
}

GET my_index/_search
{
  "query": {
    "match": {
      "title": "queen"
    }
  }
}

GET my_index/_search
{
  "query": {
    "match": {
      "title": "queen"
    }
  }, 
  "aggs": {
    "queen_filter": {
      "filter": {
        "term": {
          "title": "queen"
        }
      }
    },
    "monarch_filter": {
      "filter": {
        "term": {
          "title": "monarch"
        }
      }
    }
  }
}

Could you share the mapping you have defined for your status.synonym field?

EDIT: V2

The reason why your filter's output is 0, is because a filter in Elasticsearch never goes through an analysis phase. It's meant for exact matches.

The token 'wlh' in your aggregation will not be translated to 'wellwell', meaning that it doesn't occur in the inverted index. This is because, during index time, your 'wlh' is translated into 'wellwell'. In order to achieve what you want, you will have to index the data into a separate field and adjust your filter accordingly.

You could try something like:

DELETE my_index
PUT /my_index
{
  "settings": {
    "number_of_shards": 1, 
    "number_of_replicas": 0, 
    "analysis": {
      "filter": {
        "my_synonym_filter": {
          "type": "synonym",
          "synonyms": [
            "wlh,wellhead=>wellwell"
          ]
        }
      },
      "analyzer": {
        "my_synonyms": {
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "my_synonym_filter"
          ]
        }
      }
    }
  },
  "mappings": {
    "job": {
      "properties": {
        "title": {
          "type": "string",
          "fields": {
            "synonym": {
              "type": "string",
              "analyzer": "my_synonyms"
            }
          }
        }
      }
    }
  }
}

PUT my_index/job/1
{
    "title":"wellhead smth else"
}

PUT my_index/job/2
{
    "title":"wlh other stuff"
}

POST my_index/_search
{
  "size": 0,
  "query": {
    "match": {
      "title.synonym": {
        "query": "wlh",
        "operator": "and"
      }
    }
  },
  "aggs": {
    "wlhAggs": {
      "terms": {
        "field": "title.synonym"
      }
    },
    "wlhFilter": {
      "filter": {
        "term": {
          "title": "wlh"
        }
      }
    }
  }
}

Output:

{
  "aggregations": {
    "wlhAggs": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "wellwell",
          "doc_count": 2
        },
        {
          "key": "else",
          "doc_count": 1
        },
        {
          "key": "other",
          "doc_count": 1
        },
        {
          "key": "smth",
          "doc_count": 1
        },
        {
          "key": "stuff",
          "doc_count": 1
        }
      ]
    },
    "wlhFilter": {
      "doc_count": 1
    }
  }
}

Hope this helps!!

回答2:

So with the help of @Byron Voorbach below and his comments this is my solution:

I have created a separate field which I use synonym analyser on, as opposed to having a property field (mainfield.property).
And most importantly the problem was my synonyms were contracted! I had, for example, british,english => uk. Changing that to british,english,uk solved my issue and the filter aggregation is returning the right number of documents.

Hope this helps someone, or at least point to the right direction.

Edit: Oh lord praise the documentation! I completely fixed my issue with Filters (S!) aggregation (link here). In filters configuration I specified Match type of query and it worked! Ended up with something like this:

"aggs" : {
    "messages" : {
      "filters" : {
        "filters" : {
          "status" :   { "match" : { "cats.saurus" : "monarch"   }},
          "country" : { "match" : { "cats.saurus" : "british" }}
        }
      }
    }
  }

来源：https://stackoverflow.com/questions/46640993/elastic-synonym-usage-in-aggregations

标签

ElasticSearch

filter

analyzer

synonym