100 million customers click 100 billion times on the pages of a few web sites (let\'s say 100 websites). And the click stream is available to you in a large dataset.
Usi
I've had to do similar things, one efficiency thing you can do (that isn't really spark) is map your vistor IDs to lists of bytes rather than GUID Strings, you can save 4x space then (as 2 Chars is hex encoding of a single byte, and a Char uses 2 bytes in a String).
// Inventing these custom types purely for this question - don't do this in real life!
type VistorID = List[Byte]
type WebsiteID = Int
val visitors: RDD[(WebsiteID, VisitorID)] = ???
visitors.distinct().mapValues(_ => 1).reduceByKey(_ + _)
Note you could also do:
visitors.distinct().map(_._1).countByValue()
but this doesn't scale as well.