100 million customers click 100 billion times on the pages of a few web sites (let\'s say 100 websites). And the click stream is available to you in a large dataset.
Usi
If you want it per webpage, then visitors.distinct()...
is inefficient. If there are a lot of visitors and a lot of webpages, then you're distincting over a huge number of (webpage, visitor)
combinations, which can overwhelm the memory.
Here is a another way:
visitors.groupByKey().map {
case (webpage, visitor_iterable)
=> (webpage, visitor_iterable.toArray.distinct.length)
}
This requires that the visitors to a single webpage fits in memory, so may not be best in all cases.