100 million customers click 100 billion times on the pages of a few web sites (let\'s say 100 websites). And the click stream is available to you in a large dataset.
If data is an RDD of (site,visitor) pairs, then data.countApproxDistinctByKey(0.05) will give you an RDD of (site,count). The parameter can be reduced to get more accuracy at the cost of more processing.