Efficient Count Distinct with Apache Spark

后端未结

关注

 8  1204

盖世英雄少女心 2021-01-31 14:46

100 million customers click 100 billion times on the pages of a few web sites (let\'s say 100 websites). And the click stream is available to you in a large dataset.

Usi

8条回答

孤城傲影 (楼主)

2021-01-31 15:18
I've had to do similar things, one efficiency thing you can do (that isn't really spark) is map your vistor IDs to lists of bytes rather than GUID Strings, you can save 4x space then (as 2 Chars is hex encoding of a single byte, and a Char uses 2 bytes in a String).
```
// Inventing these custom types purely for this question - don't do this in real life!
type VistorID = List[Byte]
type WebsiteID = Int

val visitors: RDD[(WebsiteID, VisitorID)] = ???

visitors.distinct().mapValues(_ => 1).reduceByKey(_ + _)
```
Note you could also do:
```
visitors.distinct().map(_._1).countByValue()
```
but this doesn't scale as well.
0 讨论(0)

查看其它8个回答
发布评论:

提交评论
- 加载中...