Efficient Count Distinct with Apache Spark

后端未结

关注

 8  1227

盖世英雄少女心 2021-01-31 14:46

100 million customers click 100 billion times on the pages of a few web sites (let\'s say 100 websites). And the click stream is available to you in a large dataset.

Usi

8条回答

走了就别回头了 (楼主)

2021-01-31 15:19
If you want it per webpage, then visitors.distinct()... is inefficient. If there are a lot of visitors and a lot of webpages, then you're distincting over a huge number of (webpage, visitor) combinations, which can overwhelm the memory.

Here is a another way:
```
visitors.groupByKey().map { 
  case (webpage, visitor_iterable)
  => (webpage, visitor_iterable.toArray.distinct.length)
}
```
This requires that the visitors to a single webpage fits in memory, so may not be best in all cases.
0 讨论(0)

查看其它8个回答
发布评论:

提交评论
- 加载中...