问题
My question is relevant to my previous one at How to efficiently join large pyspark dataframes and small python list for some NLP results on databricks.
I have worked out part of it and now stuck by another problem.
I have a small pyspark dataframe like :
df1:
+-----+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+
|topic| termIndices| termWeights| terms|
+-----+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+
| 0| [3, 155, 108, 67, 239, 4, 72, 326, 128, 189]|[0.023463344607734377, 0.011772322769900843, 0....|[cell, apoptosis, uptake, loss, transcription, ...|
| 1| [16, 8, 161, 86, 368, 153, 18, 214, 21, 222]|[0.013057307487199429, 0.011453455929929763, 0....|[therapy, cancer, diet, lung, marker, sensitivi...|
| 2| [0, 1, 124, 29, 7, 2, 84, 299, 22, 90]|[0.03979063871841061, 0.026593954837078836, 0.0...|[group, expression, performance, use, disease, ...|
| 3| [204, 146, 74, 240, 152, 384, 55, 250, 238, 92]|[0.009305626056223443, 0.008840730657888991, 0....|[pattern, chemotherapy, mass, the amount, targe...|
It has less than 100 rows and very small. Each term has a termWeight value in the column of "termWeights".
I have another large pyspark dataframe (50+ GB) like:
df2:
+------+--------------------------------------------------+
|r_id| tokens|
+------+--------------------------------------------------+
| 0|[The human KCNJ9, Kir, GIRK3, member, potassium...|
| 1|[BACKGROUND, the treatment, breast, cancer, the...|
| 2|[OBJECTIVE, the relationship, preoperative atri...|
For each row in df2, I need to find best matching terms in df1 with the highest termWeights among all topics.
Finally, I need a df like
r_id tokens topic (the topic in df1 that has the highest sum of termWeights among all topics)
I have defined a UDF (based on df2) but it cannot access the columns of df1. I am thinking how to use "cross join" for df1 and df2 but I do not need to join each row of df2 with each row of df1. I only need to keep all columns of df2 and add one column that is the "topic" with the highest sum of termWeights based on the matching terms of each df1's topic with the terms of each df2's row.
I am not sure how to implement this logic by pyspark.sql.functions.udf.
回答1:
IIUC, you can try something like the following (I split the processing flow into 4 steps, Spark 2.4+ is required):
Step-1: convert all df2.tokens to lowercase so we can do text comparison:
from pyspark.sql.functions import expr, desc, row_number, broadcast
df2 = df2.withColumn('tokens', expr("transform(tokens, x -> lower(x))"))
Step-2: left-join df2 with df1 using arrays_overlap
df3 = df2.join(broadcast(df1), expr("arrays_overlap(terms, tokens)"), "left")
Step-3: use aggregate function to calculate matched_sum_of_weights from terms, termWeights and tokens
df4 = df3.selectExpr(
"r_id",
"tokens",
"topic",
"""
aggregate(
/* find all terms+termWeights which are shown in tokens array */
filter(arrays_zip(terms,termWeights), x -> array_contains(tokens, x.terms)),
0D,
/* get the sum of all termWeights from the matched terms */
(acc, y) -> acc + y.termWeights
) as matched_sum_of_weights
""")
Step-4: for each r_id, find the row with highest matched_sum_of_weights
using Window function and only keep rows having row_number == 1
from pyspark.sql import Window
w1 = Window.partitionBy('r_id').orderBy(desc('matched_sum_of_weights'))
df_new = df4.withColumn('rn', row_number().over(w1)).filter('rn=1').drop('rn', 'matched_sum_of_weights')
Alternative: if the size of df1 is not very large, this might be handled without join/window.partition etc. below code only outlines the idea which you should improve based on your actual data:
from pyspark.sql.functions import expr, when, coalesce, array_contains, lit, struct
# create a dict from df1 with topic as key and list of termWeights+terms as value
d = df1.selectExpr("string(topic)", "arrays_zip(termWeights,terms) as terms").rdd.collectAsMap()
# ignore this if text comparison are case-sensitive, you might do the same to df1 as well
df2 = df2.withColumn('tokens', expr("transform(tokens, x -> lower(x))"))
# save the column names of the original df2
cols = df2.columns
# iterate through all items of d(or df1) and update df2 with new columns from each
# topic with the value a struct containing `sum_of_weights`, `topic` and `has_match`(if any terms is matched)
for x,y in d.items():
df2 = df2.withColumn(x,
struct(
sum([when(array_contains('tokens', t.terms), t.termWeights).otherwise(0) for t in y]).alias('sum_of_weights'),
lit(x).alias('topic'),
coalesce(*[when(array_contains('tokens', t.terms),1) for t in y]).isNotNull().alias('has_match')
)
)
# create a new array containing all new columns(topics), and find array_max
# from items with `has_match == true`, and then retrieve the `topic` field
df_new = df2.selectExpr(
*cols,
f"array_max(filter(array({','.join(map('`{}`'.format,d.keys()))}), x -> x.has_match)).topic as topic"
)
来源:https://stackoverflow.com/questions/63769895/perform-a-user-defined-function-on-a-column-of-a-large-pyspark-dataframe-based-o