Efficient Count Distinct with Apache Spark

后端 未结 8 1210
盖世英雄少女心
盖世英雄少女心 2021-01-31 14:46

100 million customers click 100 billion times on the pages of a few web sites (let\'s say 100 websites). And the click stream is available to you in a large dataset.

Usi

8条回答
  •  孤独总比滥情好
    2021-01-31 15:29

    Df.select(approx_count_distinct("col_name",0.1))
    

    0.1 is the parameter which is saying maximum estimated error allowed. You can see much great performance with large data set.

提交回复
热议问题