Efficient Count Distinct with Apache Spark

后端 未结 8 1204
盖世英雄少女心
盖世英雄少女心 2021-01-31 14:46

100 million customers click 100 billion times on the pages of a few web sites (let\'s say 100 websites). And the click stream is available to you in a large dataset.

Usi

8条回答
  •  孤城傲影
    2021-01-31 15:18

    I've had to do similar things, one efficiency thing you can do (that isn't really spark) is map your vistor IDs to lists of bytes rather than GUID Strings, you can save 4x space then (as 2 Chars is hex encoding of a single byte, and a Char uses 2 bytes in a String).

    // Inventing these custom types purely for this question - don't do this in real life!
    type VistorID = List[Byte]
    type WebsiteID = Int
    
    val visitors: RDD[(WebsiteID, VisitorID)] = ???
    
    visitors.distinct().mapValues(_ => 1).reduceByKey(_ + _)
    

    Note you could also do:

    visitors.distinct().map(_._1).countByValue()
    

    but this doesn't scale as well.

提交回复
热议问题