Split column of list into multiple columns in the same PySpark dataframe

问题

I have the following dataframe which contains 2 columns:

1st column has column names

2nd Column has list of values.

+--------------------+--------------------+
|              Column|            Quantile|
+--------------------+--------------------+
|                rent|[4000.0, 4500.0, ...|
|     is_rent_changed|[0.0, 0.0, 0.0, 0...|
|               phone|[7.022372888E9, 7...|
|          Area_house|[1000.0, 1000.0, ...|
|       bedroom_count|[1.0, 1.0, 1.0, 1...|
|      bathroom_count|[1.0, 1.0, 1.0, 1...|
|    maintenance_cost|[0.0, 0.0, 0.0, 0...|
|            latitude|[12.8217605, 12.8...|
|            Max_rent|[9000.0, 10000.0,...|
|                Beds|[2.0, 2.0, 2.0, 2...|
|                Area|[1000.0, 1000.0, ...|
|            Avg_Rent|[3500.0, 4000.0, ...|
|      deposit_amount|[0.0, 0.0, 0.0, 0...|
|          commission|[0.0, 0.0, 0.0, 0...|
|        monthly_rent|[0.0, 0.0, 0.0, 0...|
|is_min_rent_guara...|[0.0, 0.0, 0.0, 0...|
|min_guarantee_amount|[0.0, 0.0, 0.0, 0...|
|min_guarantee_dur...|[1.0, 1.0, 1.0, 1...|
|        furnish_cost|[0.0, 0.0, 0.0, 0...|
|  owner_furnish_part|[0.0, 0.0, 0.0, 0...|
+--------------------+--------------------+

How do I split the second column into Multiple Columns Preserving the same dataset.

I can access the values using :

univar_df10.select("Column", univar_df10.Quantile[0],univar_df10.Quantile[1],univar_df10.Quantile[2]).show()

+--------------------+-------------+-------------+------------+
|              Column|  Quantile[0]|  Quantile[1]| Quantile[2]|
+--------------------+-------------+-------------+------------+
|                rent|       4000.0|       4500.0|      5000.0|
|     is_rent_changed|          0.0|          0.0|         0.0|
|               phone|7.022372888E9|7.042022842E9|7.07333021E9|
|          Area_house|       1000.0|       1000.0|      1000.0|
|       bedroom_count|          1.0|          1.0|         1.0|
|      bathroom_count|          1.0|          1.0|         1.0|
|    maintenance_cost|          0.0|          0.0|         0.0|
|            latitude|   12.8217605|   12.8490502|   12.863517|
|            Max_rent|       9000.0|      10000.0|     11500.0|
|                Beds|          2.0|          2.0|         2.0|
|                Area|       1000.0|       1000.0|      1000.0|
|            Avg_Rent|       3500.0|       4000.0|      4125.0|
|      deposit_amount|          0.0|          0.0|         0.0|
|          commission|          0.0|          0.0|         0.0|
|        monthly_rent|          0.0|          0.0|         0.0|
|is_min_rent_guara...|          0.0|          0.0|         0.0|
|min_guarantee_amount|          0.0|          0.0|         0.0|
|min_guarantee_dur...|          1.0|          1.0|         1.0|
|        furnish_cost|          0.0|          0.0|         0.0|
|  owner_furnish_part|          0.0|          0.0|         0.0|
+--------------------+-------------+-------------+------------+
only showing top 20 rows

I want my new dataframe to to split my 2nd column of lists into multiple columns like the above dataset. Thanks in advance.

回答1:

Assuming (your question is flagged for closure as unclear what you're asking) that your issue is that the lists in your Quantile column are of some length, and so it is not convenient to build the respective command by hand, here is a solution using list addition and comprehension as an argument to select:

spark.version
# u'2.2.1'

# make some toy data
from pyspark.sql import Row
df = spark.createDataFrame([Row([0,45,63,0,0,0,0]),
                            Row([0,0,0,85,0,69,0]),
                            Row([0,89,56,0,0,0,0])],
                            ['features'])

df.show()
# result:
+-----------------------+
|features               |
+-----------------------+
|[0, 45, 63, 0, 0, 0, 0]|
|[0, 0, 0, 85, 0, 69, 0]|
|[0, 89, 56, 0, 0, 0, 0]|
+-----------------------+

# get the length of your lists, if you don't know it already (here is 7):
length = len(df.select('features').take(1)[0][0])
length
# 7

df.select([df.features] + [df.features[i] for i in range(length)]).show()
# result:
+--------------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+
|            features|features[0]|features[1]|features[2]|features[3]|features[4]|features[5]|features[6]|  
+--------------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+
|[0, 45, 63, 0, 0,...|          0|         45|         63|          0|          0|          0|          0| 
|[0, 0, 0, 85, 0, ...|          0|          0|          0|         85|          0|         69|          0|
|[0, 89, 56, 0, 0,...|          0|         89|         56|          0|          0|          0|          0|
+--------------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+

So, in your case,

univar_df10.select([univar_df10.Column] + [univar_df10.Quantile[i] for i in range(length)])

should do the job, after you have calculated length as

length = len(univar_df10.select('Quantile').take(1)[0][0])

来源：https://stackoverflow.com/questions/49650907/split-column-of-list-into-multiple-columns-in-the-same-pyspark-dataframe

标签

pyspark

apache-spark-sql

spark-dataframe