pySpark mapping multiple columns

问题

I need to be able to compare two dataframes using multiple columns.

pySpark attempt

I decided to filter the reference dataframe by one level (reference_df. PrimaryLookupAttributeName compare to df1.LeaseStatus)

How can I iterate over the list of primaryLookupAttributeName_List and avoid hardcoding, LeaseStatus?
get PrimaryLookupAttributeValue values from reference table in a dictionary to compare them to df1.
output a new df with the found/match values.

I decided to hard code, FOUND because I'm not sure how to make it print the corresponding matched value from dict primaryAttributeValue_List i tried using *primaryAttributeValue_List but got TypeError: when() takes 2 positional arguments but 34 were given

#list of Attribute names to compare and match to filter reference table & df1
primaryLookupAttributeName_List = ['LeaseType', 'LeaseRecoveryType', 'LeaseStatus']

#filter reference table 
AttributeLookup = filterDomainItemLookUp2.where((filterDomainItemLookUp2.PrimaryLookupAttributeName == "LeaseStatus")) 
display(AttributeLookup)


# get PrimaryLookupAttributeValue values from reference table in a dictionary to compare them to df1. 

primaryAttributeValue_List = [ p.PrimaryLookupAttributeValue for p in AttributeLookup.select('PrimaryLookupAttributeValue').distinct().collect() ]
primaryAttributeValue_List #dict of value, vary by filter 

Out: ['Archive',
 'Pending Security Deposit',
 'Partially Abandoned',
 'Revision Contract Review',
 'Open',
 'Draft Accounting In Review',
 'Draft Returned']


# compare df1 to PrimaryLookupAttributeValue
output = dataset_standardFalse2.withColumn('ConformedLeaseStatusName', f.when(dataset_standardFalse2['LeaseStatus'].isin(primaryAttributeValue_List), "FOUND").otherwise("TBD"))

display(output)

reference_df

+--------------+----------------+--------------+--------+-----------------+--------------------+--------------------------+----------------------------+-------------------+-----------------------+---------------------------+-----------------------------+-------------------+---------------------+---------------------+
|SourceSystemId|SourceSystemName|     Portfolio|DomainId|       DomainName| PrimaryLookupEntity|PrimaryLookupAttributeName|SecondaryLookupAttributeName|StandardDomainMapId|StandardDomainItemMapId|PrimaryLookupAttributeValue|SecondaryLookupAttributeValue|OutputItemIdByValue|OutputItemCodeByValue|OutputItemNameByValue|
+--------------+----------------+--------------+--------+-----------------+--------------------+--------------------------+----------------------------+-------------------+-----------------------+---------------------------+-----------------------------+-------------------+---------------------+---------------------+
|             4|          ABC123|ALL_PORTFOLIOS|  100022|LeaseRecoveryType|ABC123_FF_Leases.csv|      LeaseRecoveryType |                            |                  9|                     329|                 Gross-modified|                               |             15|                     |            Modified Gross|
|             4|          ABC123|ALL_PORTFOLIOS|  100022|LeaseRecoveryType|ABC123_FF_Leases.csv|         LeaseRecoveryType|                            |                  9|                    330|             Gross         |                             |                 11|                     |                Gross|
|             4|          ABC123|ALL_PORTFOLIOS|  100022|LeaseRecoveryType|ABC123_FF_Leases.csv|         LeaseRecoveryType|                            |                  9|                    331|                 Gross w/base year|                      |                 18|                     |       Modified Gross|
|             4|          ABC123|ALL_PORTFOLIOS|  100011|      LeaseStatus|ABC123_FF_Leases.csv|               LeaseStatus|                            |                 10|                   1872|                  Abandoned|                             |                 10|                     |               Active|
|             4|          ABC123|ALL_PORTFOLIOS|  100011|      LeaseStatus|ABC123_FF_Leases.csv|               LeaseStatus|                            |                 10|                    332|                 Terminated|                             |                 10|                     |           Terminated|
|             4|          ABC123|ALL_PORTFOLIOS|  100011|      LeaseStatus|ABC123_FF_Leases.csv|               LeaseStatus|                            |                 10|                   1873|                    Archive|                             |                 11|                     |              Expired|
|             4|          ABC123|ALL_PORTFOLIOS|  100011|      LeaseStatus|ABC123_FF_Leases.csv|               LeaseStatus|                            |                 10|                    333|                 Draft     |                             |                 10|                     |              Pending|
+--------------+----------------+--------------+--------+-----------------+--------------------+--------------------------+----------------------------+-------------------+-----------------------+---------------------------+-----------------------------+-------------------+---------------------+---------------------+

df1

+----------------+----------------+-------------+----------------+-----------+-----------------+-------------+------------------------
|SourceSystemName|       Portfolio|SourceLeaseID|SourcePropertyID|LeaseStatus|LeaseRecoveryType|    LeaseType|   PortfolioRule|
+----------------+----------------+-------------+----------------+-----------+-----------------+-------------+-----------------------
|          ABC123|HumABC Portfolio|         1814|            1865| Terminated|            Gross|Expense Lease|  ALL_PORTFOLIOS|
|          ABC123|HumABC Portfolio|         1508|            1866|    Archive|Gross w/base year|Expense Lease|  ALL_PORTFOLIOS|
|          ABC123|HumABC Portfolio|         1826|            1875|     Draft |   Gross-modified|Expense Lease|  ALL_PORTFOLIOS|
+----------------+----------------+------

expected outcome

+----------------+----------------+-------------+----------------+--------------------------------------+-----------------+-------------------+-----------------------------------------------------------------
|SourceSystemName|       Portfolio|SourceLeaseID|SourcePropertyID|LeaseStatus|LeaseRecoveryType|    LeaseType|   PortfolioRule | Matched[LeaseStatus]OutputItemNameByValue |  Matched[LeaseRecoveryType]OutputItemNameByValue
+----------------+----------------+-------------+----------------+-----------+---------------------------------------------+-------------+--------------------------------------------------------------------
|          ABC123|HumABC Portfolio|         1814|            1865| Terminated|            Gross|Expense Lease|  ALL_PORTFOLIOS | Terminated                                | Gross
|          ABC123|HumABC Portfolio|         1508|            1866|    Archive|Gross w/base year|Expense Lease|  ALL_PORTFOLIOS | Expired                                   | Modified Gross
|          ABC123|HumABC Portfolio|         1826|            1875|     Draft |   Gross-modified|Expense Lease|  ALL_PORTFOLIOS | Pending                                   | Modified Gross
+----------------+----------------+------

outcome w/ pySpark attempt

+----------------+----------------+-------------+----------------+--------------------------------------+-----------------+-------------------+-----------------------------------------------------------------
|SourceSystemName|       Portfolio|SourceLeaseID|SourcePropertyID|LeaseStatus|LeaseRecoveryType|    LeaseType|   PortfolioRule | Matched[LeaseStatus]OutputItemNameByValue |  Matched[LeaseRecoveryType]OutputItemNameByValue
+----------------+----------------+-------------+----------------+-----------+---------------------------------------------+-------------+--------------------------------------------------------------------
|          ABC123|HumABC Portfolio|         1814|            1865| Terminated|            Gross|Expense Lease|  ALL_PORTFOLIOS | FOUND                                   | TBD
|          ABC123|HumABC Portfolio|         1508|            1866|    Archive|Gross w/base year|Expense Lease|  ALL_PORTFOLIOS | FOUND                                   | TBD
|          ABC123|HumABC Portfolio|         1826|            1875|     Draft |   Gross-modified|Expense Lease|  ALL_PORTFOLIOS | FOUND                                   | TBD
+----------------+----------------+------

回答1:

From my understanding, you can create a map based on columns from reference_df (I assumed this is not a very big dataframe):

map_key = concat_ws('\0', PrimaryLookupAttributeName, PrimaryLookupAttributeValue)
map_value = OutputItemNameByValue

and then use this mapping to get the corresponding values in df1:

from itertools import chain
from pyspark.sql.functions import collect_set, array, concat_ws, lit, col, create_map

d = reference_df.agg(collect_set(array(concat_ws('\0','PrimaryLookupAttributeName','PrimaryLookupAttributeValue'), 'OutputItemNameByValue')).alias('m')).first().m
#[['LeaseStatus\x00Abandoned', 'Active'],
# ['LeaseRecoveryType\x00Gross-modified', 'Modified Gross'],
# ['LeaseStatus\x00Archive', 'Expired'],
# ['LeaseStatus\x00Terminated', 'Terminated'],
# ['LeaseRecoveryType\x00Gross w/base year', 'Modified Gross'],
# ['LeaseStatus\x00Draft', 'Pending'],
# ['LeaseRecoveryType\x00Gross', 'Gross']]

mappings = create_map([lit(i) for i in chain.from_iterable(d)])

primaryLookupAttributeName_List = ['LeaseType', 'LeaseRecoveryType', 'LeaseStatus']

df1.select("*", *[ mappings[concat_ws('\0', lit(c), col(c))].alias("Matched[{}]OutputItemNameByValue".format(c)) for c in primaryLookupAttributeName_List ]).show()
+----------------+...+---------------------------------------+-----------------------------------------------+-----------------------------------------+
|SourceSystemName|...|Matched[LeaseType]OutputItemNameByValue|Matched[LeaseRecoveryType]OutputItemNameByValue|Matched[LeaseStatus]OutputItemNameByValue|
+----------------+...+---------------------------------------+-----------------------------------------------+-----------------------------------------+
|          ABC123|...|                                   null|                                          Gross|                               Terminated|
|          ABC123|...|                                   null|                                 Modified Gross|                                  Expired|
|          ABC123|...|                                   null|                                 Modified Gross|                                  Pending|
+----------------+...+---------------------------------------+-----------------------------------------------+-----------------------------------------+

UPDATE: to set Column names from the information retrieved through reference_df dataframe:

# a list of domains to retrieve
primaryLookupAttributeName_List = ['LeaseType', 'LeaseRecoveryType', 'LeaseStatus']

# mapping from domain names to column names: using `reference_df`.`TargetAttributeForName`
NEWprimaryLookupAttributeName_List = dict(reference_df.filter(reference_df['DomainName'].isin(primaryLookupAttributeName_List)).agg(collect_set(array('DomainName', 'TargetAttributeForName')).alias('m')).first().m)

test = dataset_standardFalse2.select("*",*[ mappings[concat_ws('\0', lit(c), col(c))].alias(c_name) for c,c_name in NEWprimaryLookupAttributeName_List.items()])

Note-1: it is better to loop through primaryLookupAttributeName_List so the order of the columns are preserved and in case any entries in primaryLookupAttributeName_List is missing from the dictionary, we can set a default column-name, i.e. Unknown-<col>. In the old method, columns with the missing entries are simply discarded.

test = dataset_standardFalse2.select("*",*[ mappings[concat_ws('\0', lit(c), col(c))].alias(NEWprimaryLookupAttributeName_List.get(c,"Unknown-{}".format(c))) for c in primaryLookupAttributeName_List])

Note-2: per comments, to overwrite the existing column names(untested):

(1) use select:

test = dataset_standardFalse2.select([c for c in dataset_standardFalse2.columns if c not in NEWprimaryLookupAttributeName_List.values()] + [ mappings[concat_ws('\0', lit(c), col(c))].alias(NEWprimaryLookupAttributeName_List.get(c,"Unknown-{}".format(c))) for c in primaryLookupAttributeName_List]).show()

(2) use reduce (not recommended if the List is very long):

from functools import reduce

df_new = reduce(lambda d, c: d.withColumn(c, mappings[concat_ws('\0', lit(c), col(c))].alias(NEWprimaryLookupAttributeName_List.get(c,"Unknown-{}".format(c)))), primaryLookupAttributeName_List, dataset_standardFalse2)

reference: PySpark create mapping from a dict

来源：https://stackoverflow.com/questions/61823544/pyspark-mapping-multiple-columns

标签

dataframe

dictionary

pyspark

pyspark-dataframes