Group by multiple columns in elastic search with specific result set

前端 未结 1 1496
遥遥无期
遥遥无期 2021-01-22 00:44

I am new to ES and i have some specific requirement, my document look like this

{
    \"_index\" : \"bidder_server_stats\",
    \"_type\" : \"doc\",
    \"_id\"          


        
相关标签:
1条回答
  • 2021-01-22 01:46

    You can achieve this with sub-aggregations. Starting from ES 6.1, composite aggregation could also come handy (although it is still experimental).

    The query might look like this:

    POST bidder_server_stats/doc/_search
    {
      "size": 0,
      "aggs": {
        "by bc": {
          "terms": {
            "field": "bc"
          },
          "aggs": {
            "by log_dt_st": {
              "terms": {
                "field": "log_dt_st"
              },
              "aggs": {
                "sum(avg_price)": {
                  "sum": {
                    "field": "avg_price"
                  }
                }
              }
            }
          }
        }
      }
    }
    

    And the response would look like this:

    {
      ...
      "aggregations": {
        "by bc": {
          "doc_count_error_upper_bound": 0,
          "sum_other_doc_count": 0,
          "buckets": [
            {
              "key": "2513",
              "doc_count": 2,
              "by log_dt_st": {
                "doc_count_error_upper_bound": 0,
                "sum_other_doc_count": 0,
                "buckets": [
                  {
                    "key": 1528439776073,
                    "key_as_string": "2018-06-08T06:36:16.073Z",
                    "doc_count": 2,
                    "sum(avg_price)": {
                      "value": 16.800000190734863
                    }
                  }
                ]
              }
            },
            {
              "key": "2514",
              "doc_count": 1,
              "by log_dt_st": {
                "doc_count_error_upper_bound": 0,
                "sum_other_doc_count": 0,
                "buckets": [
                  {
                    "key": 1528439776073,
                    "key_as_string": "2018-06-08T06:36:16.073Z",
                    "doc_count": 1,
                    "sum(avg_price)": {
                      "value": 10
                    }
                  }
                ]
              }
            }
          ]
        }
      }
    }
    

    Few moments to consider:

    • bc should have keyword type (to be able to do terms aggregation on it)
    • terms aggregation only returns top-10 buckets by default; you may be interested in size and sort options of this aggregation

    Update: Responding to the questions from the comments since it will improve the answer.

    Can we have some more fields in result set without adding them into aggregations?

    No, not directly. Like in a SQL GROUP BY, all fields returned should be either part of GROUP BY or aggregate functions.

    There are few options to actually get more data alongside the aggregations:

    • the search results themselves (the hits part);
    • top_hits aggregation, which allows to have a few most relevant documents for a given bucket.

    How many sub aggregations we can add?

    I cannot find any relevant documentation or a configuration setting that would allow to have a sure answer. However, there is index.max_docvalue_fields_search setting that defaults to 100 in Dynamic index settings. Since aggregations use doc_values, I'd say that around 100 bucket aggregations is a reasonable upper limit.

    I believe that the limitation here is the actual performance of your Elasticsearch cluster.

    Can we have all resulted fields in same bucket?

    It can be done, but might be not efficient. You may use script mode of terms aggregation. The query might look like this:

    POST bidder_server_stats/doc/_search
    {
      "size": 0,
      "aggs": {
        "via script": {
          "terms": {
            "script": {
              "source": "doc['bc'].value +':::'+ doc['log_dt_st'].value ",
              "lang": "painless"
            }
          },
          "aggs": {
            "sum(avg_price)": {
              "sum": {
                "field": "avg_price"
              }
            }
          }
        }
      }
    }
    

    And the result will look like the following:

    {
      ...
      "aggregations": {
        "via script": {
          "doc_count_error_upper_bound": 0,
          "sum_other_doc_count": 0,
          "buckets": [
            {
              "key": "2513:::2018-06-08T06:36:16.073Z",
              "doc_count": 2,
              "sum(avg_price)": {
                "value": 16.800000190734863
              }
            },
            {
              "key": "2514:::2018-06-08T06:36:16.073Z",
              "doc_count": 1,
              "sum(avg_price)": {
                "value": 10
              }
            }
          ]
        }
      }
    }
    

    In order to perform this aggregation Elasticsearch will have to compute the bucket value for each document matching the query, which is an equivalent of the full scan in SQL. Aggregations instead are more like index look-ups, since they use doc_values data representation, a data structure that makes these look-ups efficient.

    In some cases script buckets can be a solution, but their scope is quite limited. In case you are interested in script-based solution there is also scripted metric aggregation to consider.

    Hope that helps!

    Update: Since ES 6.1 it is possible to do with composite aggregation

    In Elasticsearch 6.1 composite aggregation was added. As of 6.3, it is still marked as experimental (so the API might change, or this feature could be removed completely in the future).

    The query in this case would look like:

    POST bidder_server_stats/doc/_search
    {
      "size": 0,
      "aggs": {
        "my composite": {
          "composite": {
            "sources": [
              {
                "bc": {
                  "terms": {
                    "field": "bc"
                  }
                }
              },
              {
                "log_dt_st": {
                  "terms": {
                    "field": "log_dt_st"
                  }
                }
              }
            ]
          },
          "aggs": {
            "sum(avg_price)": {
              "sum": {
                "field": "avg_price"
              }
            }
          }
        }
      }
    }
    

    And the response:

    {
      "aggregations": {
        "my composite": {
          "after_key": {
            "bc": "2514",
            "log_dt_st": 1528439776073
          },
          "buckets": [
            {
              "key": {
                "bc": "2513",
                "log_dt_st": 1528439776073
              },
              "doc_count": 2,
              "sum(avg_price)": {
                "value": 16.800000190734863
              }
            },
            {
              "key": {
                "bc": "2514",
                "log_dt_st": 1528439776073
              },
              "doc_count": 1,
              "sum(avg_price)": {
                "value": 10
              }
            }
          ]
        }
      }
    }
    
    0 讨论(0)
提交回复
热议问题