Spark scala get an array of type string from multiple columns

三世轮回 提交于 2019-12-08 13:46:14

问题


I am using spark with scala.

Imagine the input:

I would like to know how to get the following output [see the column accumulator on the following image] which should be a Array of type String Array[String]

In my real dataframe I have more than 3 columns. I have several thousand of column.

How can I proceed in order to get my desired output?


回答1:


You can use an array function and map a sequence of columns:

import org.apache.spark.sql.functions.{array, col, udf} 

val tmp = array(df.columns.map(c => when(col(c) =!= 0, c)):_*)

where

when(col(c) =!= 0, c)

takes a column name if column value is different than zero and null otherwise.

and use an UDF to filter nulls:

val dropNulls = udf((xs: Seq[String]) => xs.flatMap(Option(_)))
df.withColumn("accumulator", dropNulls(tmp))

So with example data:

val df = Seq((1, 0, 1), (0, 1, 1), (1, 0, 0)).toDF("apple", "orange", "kiwi")

you first get:

+-----+------+----+--------------------+
|apple|orange|kiwi|                 tmp|
+-----+------+----+--------------------+
|    1|     0|   1| [apple, null, kiwi]|
|    0|     1|   1|[null, orange, kiwi]|
|    1|     0|   0| [apple, null, null]|
+-----+------+----+--------------------+

and finally:

+-----+------+----+--------------+
|apple|orange|kiwi|   accumulator|
+-----+------+----+--------------+
|    1|     0|   1| [apple, kiwi]|
|    0|     1|   1|[orange, kiwi]|
|    1|     0|   0|       [apple]|
+-----+------+----+--------------+


来源:https://stackoverflow.com/questions/40021282/spark-scala-get-an-array-of-type-string-from-multiple-columns

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!