I have a temporary index with documents that I need to moderate. I want to group these documents by the words they contain.
For example, I have these documents:
It might be because this question and the accepted answer are some years old, but now there is a better way.
The accepted answer does not take into account the fact that the most common words are usually uninteresting, e.g. stopwords such as "the", "a", "in", "for" and so on.
This is usually the case for fields that contain data of type text
and not keyword
.
This is why ElasticSearch actually has an aggregation specifically for this purpose called Significant Text Aggregation.
From the docs:
text
fieldsIt can, however, take longer than other kinds of queries, so it is suggested to use this after filtering the data with a query.match, or with a previous aggregation of type sampler.
So, in your case you would send a query like this (leaving out the filtering/sampling):
{
"aggs": {
"keywords": {
"significant_text": {
"field": "myfield",
}
}
}
}
Doing a simple term aggregation search will meet your needs:
(where mydata
is the name of your field)
curl -XGET 'http://localhost:9200/test/data/_search?search_type=count&pretty' -d '{
"query": {
"match_all" : {}
},
"aggs" : {
"mydata_agg" : {
"terms": {"field" : "mydata"}
}
}
}'
will return:
{
"took" : 3,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 3,
"max_score" : 0.0,
"hits" : [ ]
},
"aggregations" : {
"mydata_agg" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [ {
"key" : "aaa",
"doc_count" : 3
}, {
"key" : "fff",
"doc_count" : 3
}, {
"key" : "bbb",
"doc_count" : 2
}, {
"key" : "ccc",
"doc_count" : 1
}, {
"key" : "ffffd",
"doc_count" : 1
}, {
"key" : "eee",
"doc_count" : 1
}, {
"key" : "hhh",
"doc_count" : 1
}, {
"key" : "mmm",
"doc_count" : 1
}, {
"key" : "xxx",
"doc_count" : 1
} ]
}
}
}