I\'d like to fetch a list of unique numeric user IDs in given period.
Let say the field is userId
and time field is startTime
, I successful
This approach may be OK for some sets but a couple of observations:
You can probably skip the Cardinality Aggregation to get the size, and simply pass int.MaxValue
as the size for the Terms Aggregation. An alternative approach that would be less efficient in terms of speed would be to scroll through all documents in the range, source filter to only return the field that you're interested in. I would expect the Scroll approach to put less pressure on the cluster, but I would recommend to monitor any approach that you take.
Here's a comparison of the two approaches on the Stack Overflow data set (taken June 2016, IIRC), looking at unique question askers between 2 years ago today and a year ago today.
void Main()
{
var pool = new SingleNodeConnectionPool(new Uri("http://localhost:9200"));
var connectionSettings = new ConnectionSettings(pool)
.MapDefaultTypeIndices(d => d
.Add(typeof(Question), NDC.StackOverflowIndex)
);
var client = new ElasticClient(connectionSettings);
var twoYearsAgo = DateTime.UtcNow.Date.AddYears(-2);
var yearAgo = DateTime.UtcNow.Date.AddYears(-1);
var searchResponse = client.Search<Question>(s => s
.Size(0)
.Query(q => q
.DateRange(c => c.Field(p => p.CreationDate)
.GreaterThan(twoYearsAgo)
.LessThan(yearAgo)
)
)
.Aggregations(a => a
.Terms("unique_users", c => c
.Field(f => f.OwnerUserId)
.Size(int.MaxValue)
)
)
);
var uniqueOwnerUserIds = searchResponse.Aggs.Terms("unique_users").Buckets.Select(b => b.KeyAsString).ToList();
// 3.83 seconds
// unique question askers: 795352
Console.WriteLine($"unique question askers: {uniqueOwnerUserIds.Count}");
}
void Main()
{
var pool = new SingleNodeConnectionPool(new Uri("http://localhost:9200"));
var connectionSettings = new ConnectionSettings(pool)
.MapDefaultTypeIndices(d => d
.Add(typeof(Question), NDC.StackOverflowIndex)
);
var client = new ElasticClient(connectionSettings);
var uniqueOwnerUserIds = new HashSet<int>();
var twoYearsAgo = DateTime.UtcNow.Date.AddYears(-2);
var yearAgo = DateTime.UtcNow.Date.AddYears(-1);
var searchResponse = client.Search<Question>(s => s
.Source(sf => sf
.Include(ff => ff
.Field(f => f.OwnerUserId)
)
)
.Size(10000)
.Scroll("1m")
.Query(q => q
.DateRange(c => c
.Field(p => p.CreationDate)
.GreaterThan(twoYearsAgo)
.LessThan(yearAgo)
)
)
);
while (searchResponse.Documents.Any())
{
foreach (var document in searchResponse.Documents)
{
if (document.OwnerUserId.HasValue)
uniqueOwnerUserIds.Add(document.OwnerUserId.Value);
}
searchResponse = client.Scroll<Question>("1m", searchResponse.ScrollId);
}
client.ClearScroll(c => c.ScrollId(searchResponse.ScrollId));
// 91.8 seconds
// unique question askers: 795352
Console.WriteLine($"unique question askers: {uniqueOwnerUserIds.Count}");
}
Terms aggregation is ~24 times faster than the Scroll API approach.