Efficient Count Distinct with Apache Spark

后端未结

关注

 8  1226

100 million customers click 100 billion times on the pages of a few web sites (let\'s say 100 websites). And the click stream is available to you in a large dataset.

Usi

相关标签:

8条回答

慢半拍i

2021-01-31 15:26

Spark 2.0 added ApproxCountDistinct into dataframe and SQL APIs:

https://databricks.com/blog/2016/05/19/approximate-algorithms-in-apache-spark-hyperloglog-and-quantiles.html

https://spark.apache.org/docs/1.6.2/api/java/org/apache/spark/sql/functions.html#approxCountDistinct(org.apache.spark.sql.Column)

0 讨论(0)
发布评论:

提交评论
- 加载中...
孤独总比滥情好

2021-01-31 15:29
```
Df.select(approx_count_distinct("col_name",0.1))
```
0.1 is the parameter which is saying maximum estimated error allowed. You can see much great performance with large data set.
0 讨论(0)
发布评论:

提交评论
- 加载中...

上一页 1 2