Elasticsearch - How to get popular words list of documents

前端 未结 2 1756
离开以前
离开以前 2021-01-31 21:18

I have a temporary index with documents that I need to moderate. I want to group these documents by the words they contain.

For example, I have these documents:

相关标签:
2条回答
  • 2021-01-31 21:36

    It might be because this question and the accepted answer are some years old, but now there is a better way.

    The accepted answer does not take into account the fact that the most common words are usually uninteresting, e.g. stopwords such as "the", "a", "in", "for" and so on.

    This is usually the case for fields that contain data of type text and not keyword.

    This is why ElasticSearch actually has an aggregation specifically for this purpose called Significant Text Aggregation.
    From the docs:

    • It is specifically designed for use on type text fields
    • It does not require field data or doc-values
    • It re-analyzes text content on-the-fly meaning it can also filter duplicate sections of noisy text that otherwise tend to skew statistics.

    It can, however, take longer than other kinds of queries, so it is suggested to use this after filtering the data with a query.match, or with a previous aggregation of type sampler.

    So, in your case you would send a query like this (leaving out the filtering/sampling):

    {
        "aggs": {
            "keywords": {
                "significant_text": {
                    "field": "myfield",
                }
            }
        }
    }
    
    0 讨论(0)
  • 2021-01-31 21:42

    Doing a simple term aggregation search will meet your needs:

    (where mydata is the name of your field)

    curl -XGET 'http://localhost:9200/test/data/_search?search_type=count&pretty' -d '{
      "query": {
        "match_all" : {}
      },
      "aggs" : {
          "mydata_agg" : {
        "terms": {"field" : "mydata"}
        }
      }
    }'
    

    will return:

    {
      "took" : 3,
      "timed_out" : false,
      "_shards" : {
        "total" : 5,
        "successful" : 5,
        "failed" : 0
      },
      "hits" : {
        "total" : 3,
        "max_score" : 0.0,
        "hits" : [ ]
      },
      "aggregations" : {
        "mydata_agg" : {
          "doc_count_error_upper_bound" : 0,
          "sum_other_doc_count" : 0,
          "buckets" : [ {
            "key" : "aaa",
            "doc_count" : 3
          }, {
            "key" : "fff",
            "doc_count" : 3
          }, {
            "key" : "bbb",
            "doc_count" : 2
          }, {
            "key" : "ccc",
            "doc_count" : 1
          }, {
            "key" : "ffffd",
            "doc_count" : 1
          }, {
            "key" : "eee",
            "doc_count" : 1
          }, {
            "key" : "hhh",
            "doc_count" : 1
          }, {
            "key" : "mmm",
            "doc_count" : 1
          }, {
            "key" : "xxx",
            "doc_count" : 1
          } ]
        }
      }
    }
    
    0 讨论(0)
提交回复
热议问题