Why Spark doesn't allow map-side combining with array keys?

前端 未结 1 2033
慢半拍i
慢半拍i 2021-01-05 09:21

I\'m using Spark 1.3.1 and I\'m curious why Spark doesn\'t allow using array keys on map-side combining. Piece of combineByKey function:



        
相关标签:
1条回答
  • 2021-01-05 09:56

    Basically for the same reason why default partitioner cannot partition array keys.

    Scala Array is just a wrapper around Java array and its hashCode doesn't depend on a content:

    scala> val x = Array(1, 2, 3)
    x: Array[Int] = Array(1, 2, 3)
    
    scala> val h = x.hashCode
    h: Int = 630226932
    
    scala> x(0) = -1
    
    scala> x.hashCode() == h1
    res3: Boolean = true
    

    It means that two arrays with exact the same content are not equal

    scala> x
    res4: Array[Int] = Array(-1, 2, 3)
    
    scala> val y = Array(-1, 2, 3)
    y: Array[Int] = Array(-1, 2, 3)
    
    scala> y == x
    res5: Boolean = false
    

    As result Arrays cannot be used as a meaningful keys. If you're not convinced just check what happens when you use Array as key for Scala Map:

    scala> Map(Array(1) -> 1, Array(1) -> 2)
    res7: scala.collection.immutable.Map[Array[Int],Int] = Map(Array(1) -> 1, Array(1) -> 2)
    

    If you want to use a collection as key you should use an immutable data structure like a Vector or a List.

    scala> Map(Array(1).toVector -> 1, Array(1).toVector -> 2)
    res15: scala.collection.immutable.Map[Vector[Int],Int] = Map(Vector(1) -> 2)
    

    See also:

    • SI-1607
    • How does HashPartitioner work?
    • A list as a key for PySpark's reduceByKey
    0 讨论(0)
提交回复
热议问题