问题
getting started with spark I would like to know how to flatmap
or explode
a dataframe.
It was created using df.groupBy("columName").count
and has the following structure if I collect it:
[[Key1, count], [Key2, count2]]
But I would rather like to have something like
Map(bar -> 1, foo -> 1, awesome -> 1)
What is the right tool to achieve something like this? Flatmap, explode or something else?
Context: I want to use spark-jobserver. It only seems to provide meaningful results (e.g. a working json serialization) in case I supply the data in the latter forml
回答1:
I'm assuming you're calling collect
or collectAsList
on the DataFrame? That would return an Array[Row]
/ List[Row]
.
If so - the easiest way to transform these into maps is to use the underlying RDD, map its recrods into key-value tuples and use collectAsMap
:
def counted = df.groupBy("columName").count()
// obviously, replace "keyColumn" and "valueColumn" with your actual column names
def result = counted.rdd.map(r => (r.getAs[String]("keyColumn"), r.getAs[Long]("valueColumn"))).collectAsMap()
result
has type Map[String, Long]
as expected.
来源:https://stackoverflow.com/questions/36541565/spark-flattening-out-dataframes