Adding a List element as a column to existing pyspark dataframe

问题

I have a list lists=[0,1,2,3,5,6,7]. Order is not sequential. I have a pyspark dataframe with 9 columns.

+-------------------+--------+--------+--------+--------+--------+--------+---------------+-----+----+
|               date|ftt (°c)|rtt (°c)|fbt (°c)|rbt (°c)|fmt (°c)|rmt (°c)|fmhhumidityunit|index|Diff|
+-------------------+--------+--------+--------+--------+--------+--------+---------------+-----+----+
|2019-02-01 05:29:47|     NaN|     NaN|     NaN|     NaN|     NaN|     NaN|            NaN|    0| NaN|
|2019-02-01 05:29:17|     NaN|     NaN|     NaN|     NaN|     NaN|    NaN|           NaN|    1| NaN |

I need to add my lists as a column to my existing dataframe. My lists is not in order so iam not able to use udf. Is there a way to do it?.Please help me I want it to be like this

+-------------------+--------+--------+--------+--------+--------+--------+---------------+-----+----+------+
|               date|ftt (°c)|rtt (°c)|fbt (°c)|rbt (°c)|fmt (°c)|rmt (°c)|fmhhumidityunit|index|Diff|lists |
+-------------------+--------+--------+--------+--------+--------+--------+---------------+-----+----+-------+
|2019-02-01 05:29:47|     NaN|     NaN|     NaN|     NaN|     NaN|     NaN|            NaN|    0| NaN|0     |
|2019-02-01 05:29:17|     NaN|     NaN|     NaN|     NaN|     NaN|     NaN|           NaN|    1| NaN |1     |

回答1:

Not too sure if it has to be something like this or were you expecting something else. If your number of list items and dataframe rows has to be same then here's a simple approach.

For a given sample dataframe with three columns:

 l = [(1,'DEF',33),(2,'KLM',22),(3,'ABC',32),(4,'XYZ',77)]
 df=spark.createDataFrame(l, ['id', 'value','age'])

Lets say here's a list:

lists=[5,6,7,8]

Can create a rdd from this list and use a zip function with the dataframe and use map function over it.

listrdd = sc.parallelize(lists)

newdf=df.rdd.zip(listrdd).map(lambda (x,y ) : ([x for x in x] + [y])).toDF(["id", "Value",",age","List_element"])

>>> ziprdd=df.rdd.zip(listrdd)
>>> ziprdd.take(50)
[(Row(id=1, value=u'DEF', age=33), 5), (Row(id=2, value=u'KLM', age=22), 6), (Row(id=3, value=u'ABC', age=32), 7), (Row(id=4, value=u'XYZ', age=77), 8)]

As zip function return key value pairs having first element contains data from first rdd and second element contains data from second rdd. I am using list comprehension for first element and concatenating it with second element.

It's dynamic and can work for n number of columns but list elements and dataframe rows has to be same.

>>> newdf.show()
]+---+-----+----+------------+
| id|Value|,age|List_element|
+---+-----+----+------------+
|  1|  DEF|  33|           5|
|  2|  KLM|  22|           6|
|  3|  ABC|  32|           7|
|  4|  XYZ|  77|           8|
+---+-----+----+------------+

Note: Both rdd partition count has to be same for using zip method else you will get an error

ValueError: Can only zip with RDD which has the same number of partitions

回答2:

you can join two dfs, like this:

df2 = spark.createDataFrame()
df= df.join(df2, on=['index']).drop('index')

df2 will contain the columns you wish to add to the main df.

来源：https://stackoverflow.com/questions/58188495/adding-a-list-element-as-a-column-to-existing-pyspark-dataframe

标签

python

pyspark

pyspark-dataframes