问题
I have a dataframe with two columns, listA
stored as Seq[String]
and valB
stored as String
. I want to create a third column valC
, which will be of Int type and its value isiff valB is present in listA then 1 otherwise 0
I tried doing the following:
val dfWithAdditionalColumn = df.withColumn("valC", when($"listA".contains($"valB"), 1).otherwise(0))
But Spark failed to execute this and gave the following error:
cannot resolve 'contains('listA', 'valB')' due to data type mismatch: argument 1 requires string type, however, 'listA' is of array type.;
How do I use a array type column value in CASE statement?
Thanks, Devj
回答1:
You can write a simple udf that will check if the element is present in the array :
val arrayContains = udf( (col1: Int, col2: Seq[Int]) => if(col2.contains(col1) ) 1 else 0 )
And then just call it and pass the necessary columns in the correct order :
df.withColumn("hasAInB", arrayContains($"a", $"b" ) ).show
+---+---------+-------+
| a| b|hasAInB|
+---+---------+-------+
| 1| [1, 2]| 1|
| 2|[2, 3, 4]| 1|
| 3| [1, 4]| 0|
+---+---------+-------+
回答2:
You should use array_contains
:
import org.apache.spark.sql.functions.{expr, array_contains}
df.withColumn("valC", when(expr("array_contains(listA, valB)"), 1).otherwise(0))
来源:https://stackoverflow.com/questions/46220883/how-to-use-array-type-column-value-in-case-statement