问题
I'm new to Python and PySpark. I have a dataframe in PySpark like the following:
## +---+---+------+
## | x1| x2| x3 |
## +---+---+------+
## | 0| a | 13.0|
## | 2| B | -33.0|
## | 1| B | -63.0|
## +---+---+------+
I have an array: arr = [10, 12, 13]
I want to create a column x4 in the dataframe such that it should have the corresponding values from the list based on the values of x1 as indices. The final dataset should look like:
## +---+---+------+-----+
## | x1| x2| x3 | x4 |
## +---+---+------+-----+
## | 0| a | 13.0| 10 |
## | 2| B | -33.0| 13 |
## | 1| B | -63.0| 12 |
## +---+---+------+-----+
I have tried using the following code to achieve so:
df.withColumn("x4", lit(arr[col('x1')])).show()
However, I am getting an error:
IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices
Is there any way I can achieve this efficiently?
回答1:
As you're doing a join between the indices of your array and your original DataFrame, one approach would be to convert your array into a DataFrame, generate the rownumber()-1
(which becomes your indices) and then join the two DataFrames together.
from pyspark.sql import Row
# Create original DataFrame `df`
df = sqlContext.createDataFrame(
[(0, "a", 13.0), (2, "B", -33.0), (1, "B", -63.0)], ("x1", "x2", "x3"))
df.createOrReplaceTempView("df")
# Create column "x4"
row = Row("x4")
# Take the array
arr = [10, 12, 13]
# Convert Array to RDD, and then create DataFrame
rdd = sc.parallelize(arr)
df2 = rdd.map(row).toDF()
df2.createOrReplaceTempView("df2")
# Create indices via row number
df3 = spark.sql("SELECT (row_number() OVER (ORDER by x4))-1 as indices, * FROM df2")
df3.createOrReplaceTempView("df3")
Now that you have the two DataFrames: df
and df3
, you can run the SQL query below to join the two DataFrames together.
select a.x1, a.x2, a.x3, b.x4 from df a join df3 b on b.indices = a.x1
Note, here is also good reference answer to the adding columns to DataFrames.
来源:https://stackoverflow.com/questions/40609845/create-a-column-in-a-pyspark-dataframe-using-a-list-whose-indices-are-present-in