Extract String from text pyspark

前端 未结 1 1890
清歌不尽
清歌不尽 2021-01-06 23:31

I have a pyspark dataframe:

Example:

text                  |   name   |   original_name 
------------------         


        
相关标签:
1条回答
  • 2021-01-06 23:45

    One simple solution is to use join between the original DataFrame and a derived DataFrame with just the name column. As the join condition could be satisfied by multiple rows, we'll have to groupby the original column after join.

    Here is a detailed example for your input :

    data = [
        ("HELLOWORLD2019THISISGOOGLE", "WORLD2019", "WORLD_2019"),
        ("NATUREISVERYGOODFOROURHEALTH", None, None),
        ("THESUNCONTAINVITAMIND", "VITAMIND", "VITAMIN_D"),
        ("BECARETOOURHEALTHISVITAMIND", "OURHEALTH", "OUR_ / HEALTH")
    ]
    df = spark.createDataFrame(data, ["text", "name", "original_name"])
    
    # create new DF with search words
    # as it's the originl_name which interests us for the final list so we select it too
    search_df = df.select(struct(col("name"), col("original_name")).alias("search_match"))
    
    # join on df.text contains search_df.name
    df_join = df.join(search_df, df.text.contains(search_df["search_match.name"]), "left")
    
    # group by original columns and collect matches in a list
    df_join.groupBy("text", "name", "original_name")\
        .agg(collect_list(col("search_match.original_name")).alias("new_column"))\
        .show(truncate=False)
    

    Output:

    +----------------------------+---------+-------------+--------------------------+
    |text                        |name     |original_name|new_column                |
    +----------------------------+---------+-------------+--------------------------+
    |HELLOWORLD2019THISISGOOGLE  |WORLD2019|WORLD_2019   |[WORLD_2019]              |
    |THESUNCONTAINVITAMIND       |VITAMIND |VITAMIN_D    |[VITAMIN_D]               |
    |NATUREISVERYGOODFOROURHEALTH|null     |null         |[OUR_ / HEALTH]           |
    |BECARETOOURHEALTHISVITAMIND |OURHEALTH|OUR_ / HEALTH|[VITAMIN_D, OUR_ / HEALTH]|
    +----------------------------+---------+-------------+--------------------------+
    
    0 讨论(0)
提交回复
热议问题