Merge two spark sql columns of type Array[string] into a new Array[string] column

前端 未结 2 1674
旧巷少年郎
旧巷少年郎 2021-01-05 15:01

I have two columns in a Spark SQL DataFrame with each entry in either column as an array of strings.

val  ngramDataFrame = Seq(
  (Seq(\"curiou         


        
2条回答
  •  花落未央
    2021-01-05 15:38

    In Spark 2.4 or later you can use concat (if you want to keep duplicates):

    ngramDataFrame.withColumn(
      "full_array", concat($"filtered_words", $"ngrams_array")
    ).show
    
    +--------------------+---------------+--------------------+
    |      filtered_words|   ngrams_array|          full_array|
    +--------------------+---------------+--------------------+
    |[curious, bought,...|[iwa, was, asj]|[curious, bought,...|
    +--------------------+---------------+--------------------+
    

    or array_union (if you want to drop duplicates):

    ngramDataFrame.withColumn(
      "full_array",
       array_union($"filtered_words", $"ngrams_array")
    )
    

    These can be also composed from the other higher order functions, for example

    ngramDataFrame.withColumn(
       "full_array",
       flatten(array($"filtered_words", $"ngrams_array"))
    )
    

    with duplicates, and

    ngramDataFrame.withColumn(
       "full_array",
       array_distinct(flatten(array($"filtered_words", $"ngrams_array")))
    )
    

    without.

    On a side note, you shouldn't use WrappedArray when working with ArrayType columns. Instead you should expect the guaranteed interface, which is Seq. So the udf should use function with following signature:

    (Seq[String], Seq[String]) => Seq[String]
    

    Please refer to SQL Programming Guide for details.

提交回复
热议问题