Merging rows with shared information

前端未结

关注

 4  1178

I have a data.frame with several rows which come from a merge which are not completely merged:

b <- read.table(text = \"
      ID   Age    Steatosis


                      
              相关标签:


      
      
        
          4条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  一生所求        
                
              
                            
                2021-01-07 04:26
              
            
            
                                                                       
Here is a base R method that should work, for a version of the data that you provided:

aggregate(b[-grep("^(ID|Age)$", names(b))], b[c("ID", "Age")], 
          FUN=function(x) if(all(is.na(x))) NA else x[!is.na(x)][1])

   ID Age Steatosis       Mallory Lille_dico Lille_3  Bili.AHHS2cat
 1 HA-09  16      <33% no/occasional         NA       5  1          


It uses aggregate together with an if else check. This will return the first element that is not missing if any should exist. I take the first element as there is at least one observation. The i in the code could be replaced by length(x) to select the last element.

As suggested by @jdobres in a comment to another answer, it would be possible to use paste with the collapse argument to combine multiple non-missing elements. This, of course would convert the type of the vector to character, which may not be desirable if the variable is numeric.

Note: I edited my original answer to include "Age" in the key, thanks to @sebastian-c for pointing this out.



If "Age" is not part of the key, then

aggregate(b[-grep("^(ID)$", names(b))], b["ID"], 
          FUN=function(x) if(all(is.na(x))) NA else x[!is.na(x)][1])


will work.

data

b <- read.table(text = "
      ID   Age    Steatosis       Mallory Lille_dico Lille_3 Bili.AHHS2cat
68 HA-09   16   NA          NA       NA       5             NA
69 HA-09   16   <33% no/occasional     NA      NA             1")

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  灰色年华        
                
              
                            
                2021-01-07 04:27
              
            
            
                                                                       
A dplyr approach using summarise_all:

## using `na.strings` to identify NA entries in posted data
b <- read.table(text = "
      ID   Age    Steatosis       Mallory Lille_dico Lille_3 Bili.AHHS2cat
68 HA-09   16   <NA>          <NA>       <NA>       5             NA
69 HA-09   16   <33% no/occasional       <NA>      NA             1", na.strings = c("NA", "<NA>"))

library(dplyr)
f <- function(x) {
  x <- na.omit(x)
  if (length(x) > 0) first(x) else NA
}
res <- b %>% group_by(ID,Age) %>% summarise_all(funs(f))
##Source: local data frame [1 x 7]
##Groups: ID [?]
##
##      ID   Age Steatosis       Mallory Lille_dico Lille_3 Bili.AHHS2cat
##  <fctr> <int>    <fctr>        <fctr>      <lgl>   <int>         <int>
##1  HA-09    16      <33% no/occasional         NA       5             1


The definition of the function is to handle the case where all values is NA.



As @jdobres suggests, if there are more than one non-NA values that you want to merge (per each column), you may want to flatten all of these to a string representation using:

library(dplyr)
f <- function(x) {
  x <- na.omit(x)
  if (length(x) > 0) paste(x,collapse='-') else NA
}
res <- b %>% group_by(ID,Age) %>% summarise_all(funs(f))


In your posted data, the result would be the same as above because all columns that are summarized has at most one non-NA value.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  栀梦        
                
              
                            
                2021-01-07 04:37
              
            
            
                                                                       
While I'm sure that it's possible with dplyr or tidyr, here's a data.table solution:

b <- read.table(text = "
      ID   Age    Steatosis       Mallory Lille_dico Lille_3 Bili.AHHS2cat
                68 HA-09   16   <NA>          <NA>       <NA>       5             NA
                69 HA-09   16   <33% no/occasional       <NA>      NA             1",
                na.strings = c("NA", "<NA>"))

keycols <- c("ID", "Age")
library(data.table)
b_dt <- data.table(b)

filter_nas <- function(x){
  if(all(is.na(x))){
    return(unique(x))
  }
  return(unique(x[!is.na(x)]))
}

b_dt[, lapply(.SD, filter_nas ), by = mget(keycols)]


      ID Age Steatosis       Mallory Lille_dico Lille_3 Bili.AHHS2cat
1: HA-09  16      <33% no/occasional         NA       5             1


Note, this only works if the keys are unique.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  梦谈多话        
                
              
                            
                2021-01-07 04:46
              
            
            
                                                                       
Llopis's request to keep both rows if a given ID has different information for a column complicates matters. First let's create some example data that illustrates the situation:

b <- read.table(text = "ID   Age    Steatosis       Mallory Lille_dico Lille_3 Bili.AHHS2cat
                HA-09   16   <NA>          <NA>       <NA>       5             NA
                HA-09   16   <33% no/occasional       <NA>      NA             1
                HA-10   20   no <NA> <NA> 2 NA
                HA-10   20   yes <NA> 0 NA NA",
                na.strings = c("NA", "<NA>"), header = T)

     ID Age Steatosis       Mallory Lille_dico Lille_3 Bili.AHHS2cat
1 HA-09  16      <NA>          <NA>         NA       5            NA
2 HA-09  16      <33% no/occasional         NA      NA             1
3 HA-10  20        no          <NA>         NA       2            NA
4 HA-10  20       yes          <NA>          0      NA            NA


This can still be accomplished, but the custom function for summarization (let's call it f) gets a little more complicated:

f <- function(x) {
    x <- x[!is.na(x$value),]
    if (nrow(x) > 0) {
        y <- unique(x[colnames(x) != 'row.ID'])
        y$row.ID <- 1:nrow(y)
        return(y)
    } else {
        return(data.frame())
    }
}


Notice that this function references a column called "row.ID", which we will create before applying the function:

library(tidyverse) # gives access to dplyr and tidyr packages

b2 <- gather(b, variable, value, -ID, -Age) %>% # gather the many columns into a simplified key/value pair of columns (one called 'variable', the other, 'value') for each ID
    group_by(ID, variable) %>% # perform subsequent operations per ID and variable
    mutate(row.ID = 1:n()) %>% # add a row identifier
    do(f(.)) %>% # apply our custom function
    spread(variable, value, convert = T) %>% # un-gather the variable/value columns
    ungroup # remove grouping metadata

      ID   Age row.ID Bili.AHHS2cat Lille_3 Lille_dico       Mallory Steatosis
* <fctr> <int>  <int>         <int>   <int>      <int>         <chr>     <chr>
1  HA-09    16      1             1       5         NA no/occasional      <33%
2  HA-10    20      1            NA       2          0          <NA>        no
3  HA-10    20      2            NA      NA         NA          <NA>       yes

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复