Convert PySpark dataframe column from list to string

后端 未结 2 1967
悲哀的现实
悲哀的现实 2021-01-01 17:44

I have this PySpark dataframe

+-----------+--------------------+
|uuid       |   test_123         |    
+-----------+--------------------+
|      1    |[test         


        
相关标签:
2条回答
  • 2021-01-01 18:09

    You can create a udf that joins array/list and then apply it to the test column:

    from pyspark.sql.functions import udf, col
    
    join_udf = udf(lambda x: ",".join(x))
    df.withColumn("test_123", join_udf(col("test_123"))).show()
    
    +----+----------------+
    |uuid|        test_123|
    +----+----------------+
    |   1|test,test2,test3|
    |   2|test4,test,test6|
    |   3|test6,test9,t55o|
    +----+----------------+
    

    The initial data frame is created from:

    from pyspark.sql.types import StructType, StructField
    schema = StructType([StructField("uuid",IntegerType(),True),StructField("test_123",ArrayType(StringType(),True),True)])
    rdd = sc.parallelize([[1, ["test","test2","test3"]], [2, ["test4","test","test6"]],[3,["test6","test9","t55o"]]])
    df = spark.createDataFrame(rdd, schema)
    
    df.show()
    +----+--------------------+
    |uuid|            test_123|
    +----+--------------------+
    |   1|[test, test2, test3]|
    |   2|[test4, test, test6]|
    |   3|[test6, test9, t55o]|
    +----+--------------------+
    
    0 讨论(0)
  • 2021-01-01 18:24

    While you can use a UserDefinedFunction it is very inefficient. Instead it is better to use concat_ws function:

    from pyspark.sql.functions import concat_ws
    
    df.withColumn("test_123", concat_ws(",", "test_123")).show()
    
    +----+----------------+
    |uuid|        test_123|
    +----+----------------+
    |   1|test,test2,test3|
    |   2|test4,test,test6|
    |   3|test6,test9,t55o|
    +----+----------------+
    
    0 讨论(0)
提交回复
热议问题