I have a row from a data frame and I want to convert it to a Map[String, Any] that maps column names to the values in the row for that column.
Is there an easy way to do
Let's say you have a row without structure information and the column header as an array.
val rdd = sc.parallelize( Seq(Row("test1", "val1"), Row("test2", "val2"), Row("test3", "val3"), Row("test4", "val4")) )
rdd.collect.foreach(println)
val sparkFieldNames = Array("col1", "col2")
val mapRDD = rdd.map(
r => sparkFieldNames.zip(r.toSeq).toMap
)
mapRDD.collect.foreach(println)
You can use getValuesMap
:
val df = Seq((1, 2.0, "a")).toDF("A", "B", "C")
val row = df.first
To get Map[String, Any]
:
row.getValuesMap[Any](row.schema.fieldNames)
// res19: Map[String,Any] = Map(A -> 1, B -> 2.0, C -> a)
Or you can get Map[String, AnyVal]
for this simple case since the values are not complex objects
row.getValuesMap[AnyVal](row.schema.fieldNames)
// res20: Map[String,AnyVal] = Map(A -> 1, B -> 2.0, C -> a)
Note: the returned value type of the getValuesMap
can be labelled as any type, so you can not rely on it to figure out what data types you have but need to keep in mind what you have from the beginning instead.
Let's say you have a data Frame with these columns:
[time(TimeStampType), col1(DoubleType), col2(DoubleType)]
You can do something like this:
val modifiedDf = df.map{row =>
val doubleObject = row.getValuesMap(Seq("col1","col2"))
val timeObject = Map("time" -> row.getAs[TimeStamp]("time"))
val map = doubleObject ++ timeObject
}
You can convert your dataframe
to rdd
and use simple map
function and use headernames
in the MAP
formation inside map
function and finally use collect
val fn = df.schema.fieldNames
val maps = df.rdd.map(row => fn.map(field => field -> row.getAs(field)).toMap).collect()