发表新帖

发表新帖

Why Spark doesn't allow map-side combining with array keys?

前端未结

关注

 1  2031

慢半拍i 2021-01-05 09:21

I\'m using Spark 1.3.1 and I\'m curious why Spark doesn\'t allow using array keys on map-side combining. Piece of combineByKey function:

1条回答

孤街浪徒 (楼主)

2021-01-05 09:56
Basically for the same reason why default partitioner cannot partition array keys.

Scala Array is just a wrapper around Java array and its hashCode doesn't depend on a content:
```
scala> val x = Array(1, 2, 3)
x: Array[Int] = Array(1, 2, 3)

scala> val h = x.hashCode
h: Int = 630226932

scala> x(0) = -1

scala> x.hashCode() == h1
res3: Boolean = true
```
It means that two arrays with exact the same content are not equal
```
scala> x
res4: Array[Int] = Array(-1, 2, 3)

scala> val y = Array(-1, 2, 3)
y: Array[Int] = Array(-1, 2, 3)

scala> y == x
res5: Boolean = false
```
As result Arrays cannot be used as a meaningful keys. If you're not convinced just check what happens when you use Array as key for Scala Map:
```
scala> Map(Array(1) -> 1, Array(1) -> 2)
res7: scala.collection.immutable.Map[Array[Int],Int] = Map(Array(1) -> 1, Array(1) -> 2)
```
If you want to use a collection as key you should use an immutable data structure like a Vector or a List.
```
scala> Map(Array(1).toVector -> 1, Array(1).toVector -> 2)
res15: scala.collection.immutable.Map[Vector[Int],Int] = Map(Vector(1) -> 2)
```
See also:
- SI-1607
- How does HashPartitioner work?
- A list as a key for PySpark's reduceByKey
0 讨论(0)
发布评论:

提交评论
- 加载中...

热议问题