问题
Here is the code to create a pyspark.sql DataFrame
import numpy as np
import pandas as pd
from pyspark import SparkContext
from pyspark.sql import SQLContext
df = pd.DataFrame(np.array([[1,2,3],[4,5,6],[7,8,9],[10,11,12]]), columns=['a','b','c'])
sparkdf = sqlContext.createDataFrame(df, samplingRatio=0.1)
So that sparkdf looks like
a b c
1 2 3
4 5 6
7 8 9
10 11 12
Now I would like to add as a new column a numpy array (or even a list)
new_col = np.array([20,20,20,20])
But the standard way
sparkdf = sparkdf.withColumn('newcol', new_col)
fails. Probably an udf is the way to go, but I don't know how to create an udf that assigns one different value per DataFrame row, i.e. that iterates through new_col. I have looked at other pyspark and pyspark.sql but couldn't find a solution. Also I need to stay within pyspark.sql so not a scala solution. Thanks!
回答1:
Assuming that data frame is sorted to match order of values in an array you can zip RDDs and rebuild data frame as follows:
n = sparkdf.rdd.getNumPartitions()
# Parallelize and cast to plain integer (np.int64 won't work)
new_col = sc.parallelize(np.array([20,20,20,20]), n).map(int)
def process(pair):
return dict(pair[0].asDict().items() + [("new_col", pair[1])])
rdd = (sparkdf
.rdd # Extract RDD
.zip(new_col) # Zip with new col
.map(process)) # Add new column
sqlContext.createDataFrame(rdd) # Rebuild data frame
You can also use joins:
new_col = sqlContext.createDataFrame(
zip(range(1, 5), [20] * 4),
("rn", "new_col"))
sparkdf.registerTempTable("df")
sparkdf_indexed = sqlContext.sql(
# Make sure we have specific order and add row number
"SELECT row_number() OVER (ORDER BY a, b, c) AS rn, * FROM df")
(sparkdf_indexed
.join(new_col, new_col.rn == sparkdf_indexed.rn)
.drop(new_col.rn))
but window function component is not scalable and should be avoided with larger datasets.
Of course if all you need is a column of a single value you can simply use lit
import pyspark.sql.functions as f
sparkdf.withColumn("new_col", f.lit(20))
but I assume it is not the case.
来源:https://stackoverflow.com/questions/31930364/how-do-you-add-a-numpy-array-as-a-new-column-to-a-pyspark-sql-dataframe