I have a pyspark dataframe:
Example:
text | name | original_name
------------------
One simple solution is to use join
between the original DataFrame and a derived DataFrame with just the name
column. As the join condition could be satisfied by multiple rows, we'll have to groupby the original column after join.
Here is a detailed example for your input :
data = [
("HELLOWORLD2019THISISGOOGLE", "WORLD2019", "WORLD_2019"),
("NATUREISVERYGOODFOROURHEALTH", None, None),
("THESUNCONTAINVITAMIND", "VITAMIND", "VITAMIN_D"),
("BECARETOOURHEALTHISVITAMIND", "OURHEALTH", "OUR_ / HEALTH")
]
df = spark.createDataFrame(data, ["text", "name", "original_name"])
# create new DF with search words
# as it's the originl_name which interests us for the final list so we select it too
search_df = df.select(struct(col("name"), col("original_name")).alias("search_match"))
# join on df.text contains search_df.name
df_join = df.join(search_df, df.text.contains(search_df["search_match.name"]), "left")
# group by original columns and collect matches in a list
df_join.groupBy("text", "name", "original_name")\
.agg(collect_list(col("search_match.original_name")).alias("new_column"))\
.show(truncate=False)
Output:
+----------------------------+---------+-------------+--------------------------+
|text |name |original_name|new_column |
+----------------------------+---------+-------------+--------------------------+
|HELLOWORLD2019THISISGOOGLE |WORLD2019|WORLD_2019 |[WORLD_2019] |
|THESUNCONTAINVITAMIND |VITAMIND |VITAMIN_D |[VITAMIN_D] |
|NATUREISVERYGOODFOROURHEALTH|null |null |[OUR_ / HEALTH] |
|BECARETOOURHEALTHISVITAMIND |OURHEALTH|OUR_ / HEALTH|[VITAMIN_D, OUR_ / HEALTH]|
+----------------------------+---------+-------------+--------------------------+