100 million customers click 100 billion times on the pages of a few web sites (let\'s say 100 websites). And the click stream is available to you in a large dataset.
Usi
Df.select(approx_count_distinct("col_name",0.1))
0.1 is the parameter which is saying maximum estimated error allowed. You can see much great performance with large data set.