How to get keys and values from MapType column in SparkSQL DataFrame

前端 未结 2 433
南旧
南旧 2020-12-01 12:57

I have data in a parquet file which has 2 fields: object_id: String and alpha: Map<>.

It is read into a data frame in sparkSQL and th

相关标签:
2条回答
  • 2020-12-01 13:33

    And if you are in PySpark, I just find an easy implementation:

    from pyspark.sql.functions import map_keys
    
    alphaDF.select(map_keys("ALPHA").alias("keys")).show()
    

    You can check details in here

    0 讨论(0)
  • 2020-12-01 13:37

    Spark >= 2.3

    You can simplify the process using map_keys function:

    import org.apache.spark.sql.functions.map_keys
    

    There is also map_values function, but it won't be directly useful here.

    Spark < 2.3

    General method can be expressed in a few steps. First required imports:

    import org.apache.spark.sql.functions.udf
    import org.apache.spark.sql.Row
    

    and example data:

    val ds = Seq(
      (1, Map("foo" -> (1, "a"), "bar" -> (2, "b"))),
      (2, Map("foo" -> (3, "c"))),
      (3, Map("bar" -> (4, "d")))
    ).toDF("id", "alpha")
    

    To extract keys we can use UDF (Spark < 2.3)

    val map_keys = udf[Seq[String], Map[String, Row]](_.keys.toSeq)
    

    or built-in functions

    import org.apache.spark.sql.functions.map_keys
    
    val keysDF = df.select(map_keys($"alpha"))
    

    Find distinct ones:

    val distinctKeys = keysDF.as[Seq[String]].flatMap(identity).distinct
      .collect.sorted
    

    You can also generalize keys extraction with explode:

    import org.apache.spark.sql.functions.explode
    
    val distinctKeys = df
      // Flatten the column into key, value columns
     .select(explode($"alpha"))
     .select($"key")
     .as[String].distinct
     .collect.sorted
    

    And select:

    ds.select($"id" +: distinctKeys.map(x => $"alpha".getItem(x).alias(x)): _*)
    
    0 讨论(0)
提交回复
热议问题