Efficient Count Distinct with Apache Spark

后端 未结 8 1203
盖世英雄少女心
盖世英雄少女心 2021-01-31 14:46

100 million customers click 100 billion times on the pages of a few web sites (let\'s say 100 websites). And the click stream is available to you in a large dataset.

Usi

8条回答
  •  走了就别回头了
    2021-01-31 15:19

    If you want it per webpage, then visitors.distinct()... is inefficient. If there are a lot of visitors and a lot of webpages, then you're distincting over a huge number of (webpage, visitor) combinations, which can overwhelm the memory.

    Here is a another way:

    visitors.groupByKey().map { 
      case (webpage, visitor_iterable)
      => (webpage, visitor_iterable.toArray.distinct.length)
    }
    

    This requires that the visitors to a single webpage fits in memory, so may not be best in all cases.

提交回复
热议问题