How to loop through pandas df column, finding if string contains any string from a separate pandas df column?

前端未结

关注

 2  621

I have two pandas DataFrames in python. DF A contains a column, which is basically sentence-length strings.

|---------------------|------------------|
|


                      
              相关标签:


      
      
        
          2条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  清歌不尽        
                
              
                            
                2021-01-29 01:29
              
            
            
                                                                       
You can iterate through a dataframe with the method iterrows(). You can try this:

# Dataframes definition
df_1 = pd.DataFrame({"sentence": ["this is from france and spain", "this is from france", "this is from germany"], "other": [15, 12, 33]})
df_2 = pd.DataFrame({"country": ["spain", "france", "germany"], "other_column": [7, 7, 8]})


# Create the new dataframe
df_3 = pd.DataFrame(columns = ["sentence", "other_column", "country"])
count=0

# Iterate through the dataframes, first through the country dataframe and inside through the sentence one.
for index, row in df_2.iterrows():
    country = row.country

    for index_2, row_2 in df_1.iterrows():
        if country in row_2.sentence:
            df_3.loc[count] = (row_2.sentence, row_2.other, country)
            count+=1


So the output is:

sentence                            other_column    country
0   this is from france and spain   15              spain
1   this is from france and spain   15              france
2   this is from france             12              france
3   this is from germany            33              germany

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  孤城傲影        
                
              
                            
                2021-01-29 01:31
              
            
            
                                                                       
There's no need for a loop here. Looping over a dataframe is slow and we have optimized pandas or numpy methods for almost all of our problems. 

In this case, for your first problem, you are looking for Series.str.extract:

dfa['country'] = dfa['sentenceCol'].str.extract(f"({'|'.join(dfb['country'])})")

           sentenceCol  other column country
0  this is from france            15  france




For your second problem, you need Series.str.extractall with Series.drop_duplicates & to_numpy:

dfa['country'] = (
    dfa['sentenceCol'].str.extractall(f"({'|'.join(dfb['country'])})")
        .drop_duplicates()
        .to_numpy()
)

                     sentenceCol  other column country
0  this is from france and spain            15  france
1  this is from france and spain            15   spain




Edit

Or if your sentenceCol is not duplicated, we have to get the extracted values to a single row. We use GroupBy.agg:

dfa['country'] = (
    dfa['sentenceCol'].str.extractall(f"({'|'.join(dfb['country'])})")
        .groupby(level=0)
        .agg(', '.join)
        .to_numpy()
)

                     sentenceCol  other column        country
0  this is from france and spain            15  france, spain




Edit2

To duplicate the original rows. We join the dataframe back to our extraction:

extraction = (
    dfa['sentenceCol'].str.extractall(f"({'|'.join(dfb['country'])})")
        .rename(columns={0: 'country'})
)

dfa = extraction.droplevel(1).join(dfa).reset_index(drop=True)

  country                    sentenceCol  other column
0  france  this is from france and spain            15
1   spain  this is from france and spain            15




Dataframes used:

dfa = pd.DataFrame({'sentenceCol':['this is from france and spain']*2,
                   'other column':[15]*2})

dfb = pd.DataFrame({'country':['france', 'spain']})

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复