问题
I have a datafame and would like to add columns to it, based on values from a list.
The list of my values will vary from 3-50 values. I'm new to pySpark and I'm trying to append these values as new columns (empty) to my df.
I've seen recommended code of how to add [one column][1] to a dataframe but not multiple from a list.
mylist = ['ConformedLeaseRecoveryTypeId', 'ConformedLeaseStatusId', 'ConformedLeaseTypeId', 'ConformedLeaseRecoveryTypeName', 'ConformedLeaseStatusName', 'ConformedLeaseTypeName']
My code below only appends one column.
for new_col in mylist:
new = datasetMatchedDomains.withColumn(new_col,f.lit(0))
new.show()
[1]: https://stackoverflow.com/questions/48164206/pyspark-adding-a-column-from-a-list-of-values-using-a-udf
回答1:
We can also use list comprehension
with .select
to add new columns to the dataframe.
Example:
#sample dataframe
df.show()
#+---+-----+---+---+----+
#| _1| _2| _3| _4| _5|
#+---+-----+---+---+----+
#| |12343| |9 | 0|
#+---+-----+---+---+----+
mylist = ['ConformedLeaseRecoveryTypeId', 'ConformedLeaseStatusId', 'ConformedLeaseTypeId', 'ConformedLeaseRecoveryTypeName', 'ConformedLeaseStatusName', 'ConformedLeaseTypeName']
cols=[col(col_name) for col_name in df.columns] + [(lit(0)).name( col_name) for col_name in mylist]
#incase if you want to cast new fields then
cols=[col(col_name) for col_name in df.columns] + [(lit(0).cast("string")).name( col_name) for col_name in mylist]
#adding new columns and selecting existing columns
df.select(cols).show()
#+---+-----+---+---+----+----------------------------+----------------------+--------------------+------------------------------+------------------------+----------------------+
#| _1| _2| _3| _4| _5|ConformedLeaseRecoveryTypeId|ConformedLeaseStatusId|ConformedLeaseTypeId|ConformedLeaseRecoveryTypeName|ConformedLeaseStatusName|ConformedLeaseTypeName|
#+---+-----+---+---+----+----------------------------+----------------------+--------------------+------------------------------+------------------------+----------------------+
#| |12343| |9 | 0| 0| 0| 0| 0| 0| 0|
#+---+-----+---+---+----+----------------------------+----------------------+--------------------+------------------------------+------------------------+----------------------+
回答2:
You can just go through a list in a loop, updating your df:
for col_name in mylist:
datasetMatchedDomains = datasetMatchedDomains.withColumn(col_name, lit(0))
Interesting follow-up - if that works, try doing it with reduce
:)
P.S. Regarding your edit - withColumn
is not modifying original DataFrame, but returns a new one every time, which you're overwriting with each loop iteration.
来源:https://stackoverflow.com/questions/61757408/pyspark-adding-columns-from-a-list