Select only the first row when merging data frames with multiple matches

后端未结

关注

 4  785

I have two data frames, \"data\" and \"scores\", and want to merge them on the \"id\" column:

data = data.frame(id = c(1,2,3,4,5),
                  state =


                      
              相关标签:


      
      
        
          4条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  走了就别回头了        
                
              
                            
                2020-12-03 08:30
              
            
            
                                                                       
Here is another method using dplyr::distinct. It is useful if you want to keep all rows from 'data' even if there is no match.

data = data.frame(id=c(1,2,3,4,5),
                  state=c("KS","MN","AL","FL","CA"))
scores = data.frame(id=c(1,1,1,2,2,3,3,3),
                    score=c(66,75,78,86,85,76,75,90))
data %>% dplyr::left_join(dplyr::distinct(scores, id, .keep_all = T))
# Joining, by = "id"
# id state score
# 1  1    KS    66
# 2  2    MN    86
# 3  3    AL    76
# 4  4    FL    NA
# 5  5    CA    NA


Moreover, if you want to replace the NAs in the new data.frame, try the tidyr::replace_na() function. Example:

data %>% dplyr::left_join(dplyr::distinct(scores, id, .keep_all = T)) %>% tidyr::replace_na(replace = list("score"=0L))
# Joining, by = "id"
# id state score
# 1  1    KS    66
# 2  2    MN    86
# 3  3    AL    76
# 4  4    FL     0
# 5  5    CA     0

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  心在旅途        
                
              
                            
                2020-12-03 08:41
              
            
            
                                                                       
In base you can use match to Select only the first row when merging data frames with multiple matches.

#Return also those which found no match
(tt <- cbind(data, score=scores[match(data$id, scores$id),"score"]))
#  id state score
#1  1    KS    66
#2  2    MN    86
#3  3    AL    76
#4  4    FL    NA
#5  5    CA    NA

#Return only those which found a match
tt[!is.na(tt$score),]
#  id state score
#1  1    KS    66
#2  2    MN    86
#3  3    AL    76

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  执笔经年        
                
              
                            
                2020-12-03 08:51
              
            
            
                                                                       
Using data.table along with mult = "first" and nomatch = 0L:

require(data.table)
setDT(scores); setDT(data) # convert to data.tables by reference

scores[data, mult = "first", on = "id", nomatch=0L]
#    id score state
# 1:  1    66    KS
# 2:  2    86    MN
# 3:  3    76    AL


For each row on data's id column, the matching rows in scores' id column are found, and the first one alone is retained (because mult = "first"). If there are no matches, they're removed (because of nomatch = 0L). 
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  眼角桃花        
                
              
                            
                2020-12-03 08:53
              
            
            
                                                                       
Here is a base R method using aggregate and head:

merge(data, aggregate(score ~ id, data=scores, head, 1), by="id") 


The aggregate function breaks up the scores dataframe by id, then head is applied to get the first observation from each id. Since aggregate returns a data.frame, this is directly merged onto the data.frame data.



Probably more efficient is to subset the scores data.frame using duplicated which will achieve the same result as aggregate, but will reduce the computational overhead.

merge(data, scores[!duplicated(scores$id),], by="id")

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复