问题
I need to be able to compare two dataframes using multiple columns.
pySpark attempt
I decided to filter the reference dataframe by one level (reference_df. PrimaryLookupAttributeName compare to df1.LeaseStatus)
How can I iterate over the list of primaryLookupAttributeName_List and avoid hardcoding, LeaseStatus?
get
PrimaryLookupAttributeValue
values from reference table in a dictionary to compare them to df1.output a new df with the found/match values.
I decided to hard code, FOUND because I'm not sure how to make it print the corresponding matched value from dict
primaryAttributeValue_List
i tried using*primaryAttributeValue_List
but gotTypeError: when() takes 2 positional arguments but 34 were given
#list of Attribute names to compare and match to filter reference table & df1
primaryLookupAttributeName_List = ['LeaseType', 'LeaseRecoveryType', 'LeaseStatus']
#filter reference table
AttributeLookup = filterDomainItemLookUp2.where((filterDomainItemLookUp2.PrimaryLookupAttributeName == "LeaseStatus"))
display(AttributeLookup)
# get PrimaryLookupAttributeValue values from reference table in a dictionary to compare them to df1.
primaryAttributeValue_List = [ p.PrimaryLookupAttributeValue for p in AttributeLookup.select('PrimaryLookupAttributeValue').distinct().collect() ]
primaryAttributeValue_List #dict of value, vary by filter
Out: ['Archive',
'Pending Security Deposit',
'Partially Abandoned',
'Revision Contract Review',
'Open',
'Draft Accounting In Review',
'Draft Returned']
# compare df1 to PrimaryLookupAttributeValue
output = dataset_standardFalse2.withColumn('ConformedLeaseStatusName', f.when(dataset_standardFalse2['LeaseStatus'].isin(primaryAttributeValue_List), "FOUND").otherwise("TBD"))
display(output)
reference_df
+--------------+----------------+--------------+--------+-----------------+--------------------+--------------------------+----------------------------+-------------------+-----------------------+---------------------------+-----------------------------+-------------------+---------------------+---------------------+
|SourceSystemId|SourceSystemName| Portfolio|DomainId| DomainName| PrimaryLookupEntity|PrimaryLookupAttributeName|SecondaryLookupAttributeName|StandardDomainMapId|StandardDomainItemMapId|PrimaryLookupAttributeValue|SecondaryLookupAttributeValue|OutputItemIdByValue|OutputItemCodeByValue|OutputItemNameByValue|
+--------------+----------------+--------------+--------+-----------------+--------------------+--------------------------+----------------------------+-------------------+-----------------------+---------------------------+-----------------------------+-------------------+---------------------+---------------------+
| 4| ABC123|ALL_PORTFOLIOS| 100022|LeaseRecoveryType|ABC123_FF_Leases.csv| LeaseRecoveryType | | 9| 329| Gross-modified| | 15| | Modified Gross|
| 4| ABC123|ALL_PORTFOLIOS| 100022|LeaseRecoveryType|ABC123_FF_Leases.csv| LeaseRecoveryType| | 9| 330| Gross | | 11| | Gross|
| 4| ABC123|ALL_PORTFOLIOS| 100022|LeaseRecoveryType|ABC123_FF_Leases.csv| LeaseRecoveryType| | 9| 331| Gross w/base year| | 18| | Modified Gross|
| 4| ABC123|ALL_PORTFOLIOS| 100011| LeaseStatus|ABC123_FF_Leases.csv| LeaseStatus| | 10| 1872| Abandoned| | 10| | Active|
| 4| ABC123|ALL_PORTFOLIOS| 100011| LeaseStatus|ABC123_FF_Leases.csv| LeaseStatus| | 10| 332| Terminated| | 10| | Terminated|
| 4| ABC123|ALL_PORTFOLIOS| 100011| LeaseStatus|ABC123_FF_Leases.csv| LeaseStatus| | 10| 1873| Archive| | 11| | Expired|
| 4| ABC123|ALL_PORTFOLIOS| 100011| LeaseStatus|ABC123_FF_Leases.csv| LeaseStatus| | 10| 333| Draft | | 10| | Pending|
+--------------+----------------+--------------+--------+-----------------+--------------------+--------------------------+----------------------------+-------------------+-----------------------+---------------------------+-----------------------------+-------------------+---------------------+---------------------+
df1
+----------------+----------------+-------------+----------------+-----------+-----------------+-------------+------------------------
|SourceSystemName| Portfolio|SourceLeaseID|SourcePropertyID|LeaseStatus|LeaseRecoveryType| LeaseType| PortfolioRule|
+----------------+----------------+-------------+----------------+-----------+-----------------+-------------+-----------------------
| ABC123|HumABC Portfolio| 1814| 1865| Terminated| Gross|Expense Lease| ALL_PORTFOLIOS|
| ABC123|HumABC Portfolio| 1508| 1866| Archive|Gross w/base year|Expense Lease| ALL_PORTFOLIOS|
| ABC123|HumABC Portfolio| 1826| 1875| Draft | Gross-modified|Expense Lease| ALL_PORTFOLIOS|
+----------------+----------------+------
expected outcome
+----------------+----------------+-------------+----------------+--------------------------------------+-----------------+-------------------+-----------------------------------------------------------------
|SourceSystemName| Portfolio|SourceLeaseID|SourcePropertyID|LeaseStatus|LeaseRecoveryType| LeaseType| PortfolioRule | Matched[LeaseStatus]OutputItemNameByValue | Matched[LeaseRecoveryType]OutputItemNameByValue
+----------------+----------------+-------------+----------------+-----------+---------------------------------------------+-------------+--------------------------------------------------------------------
| ABC123|HumABC Portfolio| 1814| 1865| Terminated| Gross|Expense Lease| ALL_PORTFOLIOS | Terminated | Gross
| ABC123|HumABC Portfolio| 1508| 1866| Archive|Gross w/base year|Expense Lease| ALL_PORTFOLIOS | Expired | Modified Gross
| ABC123|HumABC Portfolio| 1826| 1875| Draft | Gross-modified|Expense Lease| ALL_PORTFOLIOS | Pending | Modified Gross
+----------------+----------------+------
outcome w/ pySpark attempt
+----------------+----------------+-------------+----------------+--------------------------------------+-----------------+-------------------+-----------------------------------------------------------------
|SourceSystemName| Portfolio|SourceLeaseID|SourcePropertyID|LeaseStatus|LeaseRecoveryType| LeaseType| PortfolioRule | Matched[LeaseStatus]OutputItemNameByValue | Matched[LeaseRecoveryType]OutputItemNameByValue
+----------------+----------------+-------------+----------------+-----------+---------------------------------------------+-------------+--------------------------------------------------------------------
| ABC123|HumABC Portfolio| 1814| 1865| Terminated| Gross|Expense Lease| ALL_PORTFOLIOS | FOUND | TBD
| ABC123|HumABC Portfolio| 1508| 1866| Archive|Gross w/base year|Expense Lease| ALL_PORTFOLIOS | FOUND | TBD
| ABC123|HumABC Portfolio| 1826| 1875| Draft | Gross-modified|Expense Lease| ALL_PORTFOLIOS | FOUND | TBD
+----------------+----------------+------
回答1:
From my understanding, you can create a map based on columns from reference_df (I assumed this is not a very big dataframe):
map_key = concat_ws('\0', PrimaryLookupAttributeName, PrimaryLookupAttributeValue)
map_value = OutputItemNameByValue
and then use this mapping to get the corresponding values in df1:
from itertools import chain
from pyspark.sql.functions import collect_set, array, concat_ws, lit, col, create_map
d = reference_df.agg(collect_set(array(concat_ws('\0','PrimaryLookupAttributeName','PrimaryLookupAttributeValue'), 'OutputItemNameByValue')).alias('m')).first().m
#[['LeaseStatus\x00Abandoned', 'Active'],
# ['LeaseRecoveryType\x00Gross-modified', 'Modified Gross'],
# ['LeaseStatus\x00Archive', 'Expired'],
# ['LeaseStatus\x00Terminated', 'Terminated'],
# ['LeaseRecoveryType\x00Gross w/base year', 'Modified Gross'],
# ['LeaseStatus\x00Draft', 'Pending'],
# ['LeaseRecoveryType\x00Gross', 'Gross']]
mappings = create_map([lit(i) for i in chain.from_iterable(d)])
primaryLookupAttributeName_List = ['LeaseType', 'LeaseRecoveryType', 'LeaseStatus']
df1.select("*", *[ mappings[concat_ws('\0', lit(c), col(c))].alias("Matched[{}]OutputItemNameByValue".format(c)) for c in primaryLookupAttributeName_List ]).show()
+----------------+...+---------------------------------------+-----------------------------------------------+-----------------------------------------+
|SourceSystemName|...|Matched[LeaseType]OutputItemNameByValue|Matched[LeaseRecoveryType]OutputItemNameByValue|Matched[LeaseStatus]OutputItemNameByValue|
+----------------+...+---------------------------------------+-----------------------------------------------+-----------------------------------------+
| ABC123|...| null| Gross| Terminated|
| ABC123|...| null| Modified Gross| Expired|
| ABC123|...| null| Modified Gross| Pending|
+----------------+...+---------------------------------------+-----------------------------------------------+-----------------------------------------+
UPDATE: to set Column names from the information retrieved through reference_df dataframe:
# a list of domains to retrieve
primaryLookupAttributeName_List = ['LeaseType', 'LeaseRecoveryType', 'LeaseStatus']
# mapping from domain names to column names: using `reference_df`.`TargetAttributeForName`
NEWprimaryLookupAttributeName_List = dict(reference_df.filter(reference_df['DomainName'].isin(primaryLookupAttributeName_List)).agg(collect_set(array('DomainName', 'TargetAttributeForName')).alias('m')).first().m)
test = dataset_standardFalse2.select("*",*[ mappings[concat_ws('\0', lit(c), col(c))].alias(c_name) for c,c_name in NEWprimaryLookupAttributeName_List.items()])
Note-1: it is better to loop through primaryLookupAttributeName_List so the order of the columns are preserved and in case any entries in primaryLookupAttributeName_List is missing from the dictionary, we can set a default column-name, i.e. Unknown-<col>
. In the old method, columns with the missing entries are simply discarded.
test = dataset_standardFalse2.select("*",*[ mappings[concat_ws('\0', lit(c), col(c))].alias(NEWprimaryLookupAttributeName_List.get(c,"Unknown-{}".format(c))) for c in primaryLookupAttributeName_List])
Note-2: per comments, to overwrite the existing column names(untested):
(1) use select:
test = dataset_standardFalse2.select([c for c in dataset_standardFalse2.columns if c not in NEWprimaryLookupAttributeName_List.values()] + [ mappings[concat_ws('\0', lit(c), col(c))].alias(NEWprimaryLookupAttributeName_List.get(c,"Unknown-{}".format(c))) for c in primaryLookupAttributeName_List]).show()
(2) use reduce (not recommended if the List is very long):
from functools import reduce
df_new = reduce(lambda d, c: d.withColumn(c, mappings[concat_ws('\0', lit(c), col(c))].alias(NEWprimaryLookupAttributeName_List.get(c,"Unknown-{}".format(c)))), primaryLookupAttributeName_List, dataset_standardFalse2)
reference: PySpark create mapping from a dict
来源:https://stackoverflow.com/questions/61823544/pyspark-mapping-multiple-columns