pySpark adding columns from a list

问题

I have a datafame and would like to add columns to it, based on values from a list.

The list of my values will vary from 3-50 values. I'm new to pySpark and I'm trying to append these values as new columns (empty) to my df.

I've seen recommended code of how to add [one column][1] to a dataframe but not multiple from a list.

mylist = ['ConformedLeaseRecoveryTypeId', 'ConformedLeaseStatusId', 'ConformedLeaseTypeId', 'ConformedLeaseRecoveryTypeName', 'ConformedLeaseStatusName', 'ConformedLeaseTypeName']

My code below only appends one column.

for new_col in mylist:
  new = datasetMatchedDomains.withColumn(new_col,f.lit(0))
new.show()




  [1]: https://stackoverflow.com/questions/48164206/pyspark-adding-a-column-from-a-list-of-values-using-a-udf

回答1:

We can also use list comprehension with .select to add new columns to the dataframe.

Example:

#sample dataframe
df.show()
#+---+-----+---+---+----+
#| _1|   _2| _3| _4|  _5|
#+---+-----+---+---+----+
#|   |12343|   |9  |   0|
#+---+-----+---+---+----+

mylist = ['ConformedLeaseRecoveryTypeId', 'ConformedLeaseStatusId', 'ConformedLeaseTypeId', 'ConformedLeaseRecoveryTypeName', 'ConformedLeaseStatusName', 'ConformedLeaseTypeName']

cols=[col(col_name) for col_name in df.columns] + [(lit(0)).name( col_name) for col_name in mylist]

#incase if you want to cast new fields then
cols=[col(col_name) for col_name in df.columns] + [(lit(0).cast("string")).name( col_name) for col_name in mylist]

#adding new columns and selecting existing columns    
df.select(cols).show()
#+---+-----+---+---+----+----------------------------+----------------------+--------------------+------------------------------+------------------------+----------------------+
#| _1|   _2| _3| _4|  _5|ConformedLeaseRecoveryTypeId|ConformedLeaseStatusId|ConformedLeaseTypeId|ConformedLeaseRecoveryTypeName|ConformedLeaseStatusName|ConformedLeaseTypeName|
#+---+-----+---+---+----+----------------------------+----------------------+--------------------+------------------------------+------------------------+----------------------+
#|   |12343|   |9  |   0|                           0|                     0|                   0|                             0|                       0|                     0|
#+---+-----+---+---+----+----------------------------+----------------------+--------------------+------------------------------+------------------------+----------------------+

回答2:

You can just go through a list in a loop, updating your df:

for col_name in mylist:
    datasetMatchedDomains = datasetMatchedDomains.withColumn(col_name, lit(0))

Interesting follow-up - if that works, try doing it with reduce :)

P.S. Regarding your edit - withColumn is not modifying original DataFrame, but returns a new one every time, which you're overwriting with each loop iteration.

来源：https://stackoverflow.com/questions/61757408/pyspark-adding-columns-from-a-list

标签

python

dataframe

apache-spark

pyspark

databricks