How to zip two array columns in Spark SQL

后端 未结 3 783
南方客
南方客 2020-11-30 14:30

I have a Pandas dataframe. I have tried to join two columns containing string values into a list first and then using zip, I joined each element of the list with \'_\'. My d

相关标签:
3条回答
  • 2020-11-30 14:53

    You can also UDF to zip the split array columns,

    df = spark.createDataFrame([('abc,def,ghi','1.0,2.0,3.0')], ['col1','col2']) 
    +-----------+-----------+
    |col1       |col2       |
    +-----------+-----------+
    |abc,def,ghi|1.0,2.0,3.0|
    +-----------+-----------+ ## Hope this is how your dataframe is
    
    from pyspark.sql import functions as F
    from pyspark.sql.types import *
    
    def concat_udf(*args):
        return ['_'.join(x) for x in zip(*args)]
    
    udf1 = F.udf(concat_udf,ArrayType(StringType()))
    df = df.withColumn('col3',udf1(F.split(df.col1,','),F.split(df.col2,',')))
    df.show(1,False)
    +-----------+-----------+---------------------------+
    |col1       |col2       |col3                       |
    +-----------+-----------+---------------------------+
    |abc,def,ghi|1.0,2.0,3.0|[abc_1.0, def_2.0, ghi_3.0]|
    +-----------+-----------+---------------------------+
    
    0 讨论(0)
  • 2020-11-30 14:59

    For Spark 2.4+, this can be done using only zip_with function to zip a concatenate on the same time:

    df.withColumn("column_3", expr("zip_with(column_1, column_2, (x, y) -> concat(x, '_', y))")) 
    

    The higher-order function takes 2 arrays to merge, element-wise, using a lambda function (x, y) -> concat(x, '_', y).

    0 讨论(0)
  • 2020-11-30 15:09

    A Spark SQL equivalent of Python's would be pyspark.sql.functions.arrays_zip:

    pyspark.sql.functions.arrays_zip(*cols)

    Collection function: Returns a merged array of structs in which the N-th struct contains all N-th values of input arrays.

    So if you already have two arrays:

    from pyspark.sql.functions import split
    
    df = (spark
        .createDataFrame([('abc, def, ghi', '1.0, 2.0, 3.0')])
        .toDF("column_1", "column_2")
        .withColumn("column_1", split("column_1", "\s*,\s*"))
        .withColumn("column_2", split("column_2", "\s*,\s*")))
    

    You can just apply it on the result

    from pyspark.sql.functions import arrays_zip
    
    df_zipped = df.withColumn(
      "zipped", arrays_zip("column_1", "column_2")
    )
    
    df_zipped.select("zipped").show(truncate=False)
    
    +------------------------------------+
    |zipped                              |
    +------------------------------------+
    |[[abc, 1.0], [def, 2.0], [ghi, 3.0]]|
    +------------------------------------+
    

    Now to combine the results you can transform (How to use transform higher-order function?, TypeError: Column is not iterable - How to iterate over ArrayType()?):

    df_zipped_concat = df_zipped.withColumn(
        "zipped_concat",
         expr("transform(zipped, x -> concat_ws('_', x.column_1, x.column_2))")
    ) 
    
    df_zipped_concat.select("zipped_concat").show(truncate=False)
    
    +---------------------------+
    |zipped_concat              |
    +---------------------------+
    |[abc_1.0, def_2.0, ghi_3.0]|
    +---------------------------+
    

    Note:

    Higher order functions transform and arrays_zip has been introduced in Apache Spark 2.4.

    0 讨论(0)
提交回复
热议问题