How to loop through pandas df column, finding if string contains any string from a separate pandas df column?

前端 未结 2 617
旧时难觅i
旧时难觅i 2021-01-29 00:46

I have two pandas DataFrames in python. DF A contains a column, which is basically sentence-length strings.

|---------------------|------------------|
|                


        
2条回答
  •  孤城傲影
    2021-01-29 01:31

    There's no need for a loop here. Looping over a dataframe is slow and we have optimized pandas or numpy methods for almost all of our problems.

    In this case, for your first problem, you are looking for Series.str.extract:

    dfa['country'] = dfa['sentenceCol'].str.extract(f"({'|'.join(dfb['country'])})")
    
               sentenceCol  other column country
    0  this is from france            15  france
    

    For your second problem, you need Series.str.extractall with Series.drop_duplicates & to_numpy:

    dfa['country'] = (
        dfa['sentenceCol'].str.extractall(f"({'|'.join(dfb['country'])})")
            .drop_duplicates()
            .to_numpy()
    )
    
                         sentenceCol  other column country
    0  this is from france and spain            15  france
    1  this is from france and spain            15   spain
    

    Edit

    Or if your sentenceCol is not duplicated, we have to get the extracted values to a single row. We use GroupBy.agg:

    dfa['country'] = (
        dfa['sentenceCol'].str.extractall(f"({'|'.join(dfb['country'])})")
            .groupby(level=0)
            .agg(', '.join)
            .to_numpy()
    )
    
                         sentenceCol  other column        country
    0  this is from france and spain            15  france, spain
    

    Edit2

    To duplicate the original rows. We join the dataframe back to our extraction:

    extraction = (
        dfa['sentenceCol'].str.extractall(f"({'|'.join(dfb['country'])})")
            .rename(columns={0: 'country'})
    )
    
    dfa = extraction.droplevel(1).join(dfa).reset_index(drop=True)
    
      country                    sentenceCol  other column
    0  france  this is from france and spain            15
    1   spain  this is from france and spain            15
    

    Dataframes used:

    dfa = pd.DataFrame({'sentenceCol':['this is from france and spain']*2,
                       'other column':[15]*2})
    
    dfb = pd.DataFrame({'country':['france', 'spain']})
    

提交回复
热议问题