combine text from multiple rows in pyspark

后端 未结 2 844
无人及你
无人及你 2020-12-05 21:52

I created a PySpark dataframe using the following code

testlist = [
             {\"category\":\"A\",\"name\":\"         


        
相关标签:
2条回答
  • 2020-12-05 22:19

    One option is to use pyspark.sql.functions.collect_list() as the aggregate function.

    from pyspark.sql.functions import collect_list
    grouped_df = spark_df.groupby('category').agg(collect_list('name').alias("name"))
    

    This will collect the values for name into a list and the resultant output will look like:

    grouped_df.show()
    #+---------+---------+
    #|category |name     |
    #+---------+---------+
    #|A        |[A1, A2] |
    #|B        |[B1, B2] |
    #+---------+---------+
    

    Update 2019-06-10: If you wanted your output as a concatenated string, you can use pyspark.sql.functions.concat_ws to concatenate the values of the collected list, which will be better than using a udf:

    from pyspark.sql.functions import concat_ws
    
    grouped_df.withColumn("name", concat_ws(", ", "name")).show()
    #+---------+-------+
    #|category |name   |
    #+---------+-------+
    #|A        |A1, A2 |
    #|B        |B1, B2 |
    #+---------+-------+
    

    Original Answer: If you wanted your output as a concatenated string, you'd have to can use a udf. For example, you can first do the groupBy() as above and the apply a udf to join the collected list:

    from pyspark.sql.functions import udf
    concat_list = udf(lambda lst: ", ".join(lst), StringType())
    
    grouped_df.withColumn("name", concat_list("name")).show()
    #+---------+-------+
    #|category |name   |
    #+---------+-------+
    #|A        |A1, A2 |
    #|B        |B1, B2 |
    #+---------+-------+
    
    0 讨论(0)
  • 2020-12-05 22:30

    Another option is this

    >>> df.rdd.reduceByKey(lambda x,y: x+','+y).toDF().show()
    +---+-----+
    | _1|   _2|
    +---+-----+
    |  A|A1,A2|
    |  B|B1,B2|
    +---+-----+
    
    0 讨论(0)
提交回复
热议问题