Efficient Count Distinct with Apache Spark

后端未结

关注

 8  1209

盖世英雄少女心 2021-01-31 14:46

100 million customers click 100 billion times on the pages of a few web sites (let\'s say 100 websites). And the click stream is available to you in a large dataset.

Usi

8条回答

闹比i (楼主)

2021-01-31 15:15

If data is an RDD of (site,visitor) pairs, then data.countApproxDistinctByKey(0.05) will give you an RDD of (site,count). The parameter can be reduced to get more accuracy at the cost of more processing.

0 讨论(0)

查看其它8个回答
发布评论:

提交评论
- 加载中...