ElasticSearch Nest. better code for terms aggregation and its iteration

前端 未结 1 1897
执笔经年
执笔经年 2021-01-17 04:31

I\'d like to fetch a list of unique numeric user IDs in given period.

Let say the field is userId and time field is startTime, I successful

1条回答
  •  梦毁少年i
    2021-01-17 04:47

    This approach may be OK for some sets but a couple of observations:

    1. Cardinality Aggregation uses HyperLogLog++ algorithm to approximate cardinality; this approximation can be completely accurate for low cardinality fields but less so for high cardinality.
    2. Terms Aggregation may be computationally expensive for many terms, as each bucket needs to be built in memory, then serialized to response.

    You can probably skip the Cardinality Aggregation to get the size, and simply pass int.MaxValue as the size for the Terms Aggregation. An alternative approach that would be less efficient in terms of speed would be to scroll through all documents in the range, source filter to only return the field that you're interested in. I would expect the Scroll approach to put less pressure on the cluster, but I would recommend to monitor any approach that you take.

    Here's a comparison of the two approaches on the Stack Overflow data set (taken June 2016, IIRC), looking at unique question askers between 2 years ago today and a year ago today.

    Terms Aggregation

    void Main()
    {
        var pool = new SingleNodeConnectionPool(new Uri("http://localhost:9200"));
    
        var connectionSettings = new ConnectionSettings(pool)
            .MapDefaultTypeIndices(d => d
                .Add(typeof(Question), NDC.StackOverflowIndex)
            );
    
    
        var client = new ElasticClient(connectionSettings);
    
        var twoYearsAgo = DateTime.UtcNow.Date.AddYears(-2);
        var yearAgo = DateTime.UtcNow.Date.AddYears(-1);
    
        var searchResponse = client.Search(s => s
            .Size(0)
            .Query(q => q
                .DateRange(c => c.Field(p => p.CreationDate)
                    .GreaterThan(twoYearsAgo)
                    .LessThan(yearAgo)
                )
            )
            .Aggregations(a => a
                .Terms("unique_users", c => c
                    .Field(f => f.OwnerUserId)
                    .Size(int.MaxValue)
                )
            )
        );
    
        var uniqueOwnerUserIds = searchResponse.Aggs.Terms("unique_users").Buckets.Select(b => b.KeyAsString).ToList();
    
        // 3.83 seconds
        // unique question askers: 795352
        Console.WriteLine($"unique question askers: {uniqueOwnerUserIds.Count}");
    }
    

    Scroll API

    void Main()
    {
        var pool = new SingleNodeConnectionPool(new Uri("http://localhost:9200"));
    
        var connectionSettings = new ConnectionSettings(pool)
            .MapDefaultTypeIndices(d => d
                .Add(typeof(Question), NDC.StackOverflowIndex)
            );
    
        var client = new ElasticClient(connectionSettings);
        var uniqueOwnerUserIds = new HashSet();
    
        var twoYearsAgo = DateTime.UtcNow.Date.AddYears(-2);
        var yearAgo = DateTime.UtcNow.Date.AddYears(-1);
    
        var searchResponse = client.Search(s => s
            .Source(sf => sf
                .Include(ff => ff
                    .Field(f => f.OwnerUserId)
                )
            )
            .Size(10000)
            .Scroll("1m")
            .Query(q => q
                .DateRange(c => c
                    .Field(p => p.CreationDate)
                    .GreaterThan(twoYearsAgo)
                    .LessThan(yearAgo)
                )
            )
        );
    
        while (searchResponse.Documents.Any())
        {
            foreach (var document in searchResponse.Documents)
            {
                if (document.OwnerUserId.HasValue)
                    uniqueOwnerUserIds.Add(document.OwnerUserId.Value);
            }
    
            searchResponse = client.Scroll("1m", searchResponse.ScrollId);
        }
    
        client.ClearScroll(c => c.ScrollId(searchResponse.ScrollId));
    
        // 91.8 seconds
        // unique question askers: 795352
        Console.WriteLine($"unique question askers: {uniqueOwnerUserIds.Count}");
    }
    

    Terms aggregation is ~24 times faster than the Scroll API approach.

    0 讨论(0)
提交回复
热议问题