How to loop through pandas df column, finding if string contains any string from a separate pandas df column?

前端 未结 2 618
旧时难觅i
旧时难觅i 2021-01-29 00:46

I have two pandas DataFrames in python. DF A contains a column, which is basically sentence-length strings.

|---------------------|------------------|
|                


        
相关标签:
2条回答
  • 2021-01-29 01:29

    You can iterate through a dataframe with the method iterrows(). You can try this:

    # Dataframes definition
    df_1 = pd.DataFrame({"sentence": ["this is from france and spain", "this is from france", "this is from germany"], "other": [15, 12, 33]})
    df_2 = pd.DataFrame({"country": ["spain", "france", "germany"], "other_column": [7, 7, 8]})
    
    
    # Create the new dataframe
    df_3 = pd.DataFrame(columns = ["sentence", "other_column", "country"])
    count=0
    
    # Iterate through the dataframes, first through the country dataframe and inside through the sentence one.
    for index, row in df_2.iterrows():
        country = row.country
    
        for index_2, row_2 in df_1.iterrows():
            if country in row_2.sentence:
                df_3.loc[count] = (row_2.sentence, row_2.other, country)
                count+=1
    

    So the output is:

    sentence                            other_column    country
    0   this is from france and spain   15              spain
    1   this is from france and spain   15              france
    2   this is from france             12              france
    3   this is from germany            33              germany
    
    0 讨论(0)
  • 2021-01-29 01:31

    There's no need for a loop here. Looping over a dataframe is slow and we have optimized pandas or numpy methods for almost all of our problems.

    In this case, for your first problem, you are looking for Series.str.extract:

    dfa['country'] = dfa['sentenceCol'].str.extract(f"({'|'.join(dfb['country'])})")
    
               sentenceCol  other column country
    0  this is from france            15  france
    

    For your second problem, you need Series.str.extractall with Series.drop_duplicates & to_numpy:

    dfa['country'] = (
        dfa['sentenceCol'].str.extractall(f"({'|'.join(dfb['country'])})")
            .drop_duplicates()
            .to_numpy()
    )
    
                         sentenceCol  other column country
    0  this is from france and spain            15  france
    1  this is from france and spain            15   spain
    

    Edit

    Or if your sentenceCol is not duplicated, we have to get the extracted values to a single row. We use GroupBy.agg:

    dfa['country'] = (
        dfa['sentenceCol'].str.extractall(f"({'|'.join(dfb['country'])})")
            .groupby(level=0)
            .agg(', '.join)
            .to_numpy()
    )
    
                         sentenceCol  other column        country
    0  this is from france and spain            15  france, spain
    

    Edit2

    To duplicate the original rows. We join the dataframe back to our extraction:

    extraction = (
        dfa['sentenceCol'].str.extractall(f"({'|'.join(dfb['country'])})")
            .rename(columns={0: 'country'})
    )
    
    dfa = extraction.droplevel(1).join(dfa).reset_index(drop=True)
    
      country                    sentenceCol  other column
    0  france  this is from france and spain            15
    1   spain  this is from france and spain            15
    

    Dataframes used:

    dfa = pd.DataFrame({'sentenceCol':['this is from france and spain']*2,
                       'other column':[15]*2})
    
    dfb = pd.DataFrame({'country':['france', 'spain']})
    
    0 讨论(0)
提交回复
热议问题