Merging rows with shared information

前端 未结 4 1174
故里飘歌
故里飘歌 2021-01-07 04:18

I have a data.frame with several rows which come from a merge which are not completely merged:

b <- read.table(text = \"
      ID   Age    Steatosis               


        
相关标签:
4条回答
  • 2021-01-07 04:26

    Here is a base R method that should work, for a version of the data that you provided:

    aggregate(b[-grep("^(ID|Age)$", names(b))], b[c("ID", "Age")], 
              FUN=function(x) if(all(is.na(x))) NA else x[!is.na(x)][1])
    
       ID Age Steatosis       Mallory Lille_dico Lille_3  Bili.AHHS2cat
     1 HA-09  16      <33% no/occasional         NA       5  1          
    

    It uses aggregate together with an if else check. This will return the first element that is not missing if any should exist. I take the first element as there is at least one observation. The i in the code could be replaced by length(x) to select the last element.

    As suggested by @jdobres in a comment to another answer, it would be possible to use paste with the collapse argument to combine multiple non-missing elements. This, of course would convert the type of the vector to character, which may not be desirable if the variable is numeric.

    Note: I edited my original answer to include "Age" in the key, thanks to @sebastian-c for pointing this out.


    If "Age" is not part of the key, then

    aggregate(b[-grep("^(ID)$", names(b))], b["ID"], 
              FUN=function(x) if(all(is.na(x))) NA else x[!is.na(x)][1])
    

    will work.

    data

    b <- read.table(text = "
          ID   Age    Steatosis       Mallory Lille_dico Lille_3 Bili.AHHS2cat
    68 HA-09   16   NA          NA       NA       5             NA
    69 HA-09   16   <33% no/occasional     NA      NA             1")
    
    0 讨论(0)
  • 2021-01-07 04:27

    A dplyr approach using summarise_all:

    ## using `na.strings` to identify NA entries in posted data
    b <- read.table(text = "
          ID   Age    Steatosis       Mallory Lille_dico Lille_3 Bili.AHHS2cat
    68 HA-09   16   <NA>          <NA>       <NA>       5             NA
    69 HA-09   16   <33% no/occasional       <NA>      NA             1", na.strings = c("NA", "<NA>"))
    
    library(dplyr)
    f <- function(x) {
      x <- na.omit(x)
      if (length(x) > 0) first(x) else NA
    }
    res <- b %>% group_by(ID,Age) %>% summarise_all(funs(f))
    ##Source: local data frame [1 x 7]
    ##Groups: ID [?]
    ##
    ##      ID   Age Steatosis       Mallory Lille_dico Lille_3 Bili.AHHS2cat
    ##  <fctr> <int>    <fctr>        <fctr>      <lgl>   <int>         <int>
    ##1  HA-09    16      <33% no/occasional         NA       5             1
    

    The definition of the function is to handle the case where all values is NA.


    As @jdobres suggests, if there are more than one non-NA values that you want to merge (per each column), you may want to flatten all of these to a string representation using:

    library(dplyr)
    f <- function(x) {
      x <- na.omit(x)
      if (length(x) > 0) paste(x,collapse='-') else NA
    }
    res <- b %>% group_by(ID,Age) %>% summarise_all(funs(f))
    

    In your posted data, the result would be the same as above because all columns that are summarized has at most one non-NA value.

    0 讨论(0)
  • 2021-01-07 04:37

    While I'm sure that it's possible with dplyr or tidyr, here's a data.table solution:

    b <- read.table(text = "
          ID   Age    Steatosis       Mallory Lille_dico Lille_3 Bili.AHHS2cat
                    68 HA-09   16   <NA>          <NA>       <NA>       5             NA
                    69 HA-09   16   <33% no/occasional       <NA>      NA             1",
                    na.strings = c("NA", "<NA>"))
    
    keycols <- c("ID", "Age")
    library(data.table)
    b_dt <- data.table(b)
    
    filter_nas <- function(x){
      if(all(is.na(x))){
        return(unique(x))
      }
      return(unique(x[!is.na(x)]))
    }
    
    b_dt[, lapply(.SD, filter_nas ), by = mget(keycols)]
    
    
          ID Age Steatosis       Mallory Lille_dico Lille_3 Bili.AHHS2cat
    1: HA-09  16      <33% no/occasional         NA       5             1
    

    Note, this only works if the keys are unique.

    0 讨论(0)
  • 2021-01-07 04:46

    Llopis's request to keep both rows if a given ID has different information for a column complicates matters. First let's create some example data that illustrates the situation:

    b <- read.table(text = "ID   Age    Steatosis       Mallory Lille_dico Lille_3 Bili.AHHS2cat
                    HA-09   16   <NA>          <NA>       <NA>       5             NA
                    HA-09   16   <33% no/occasional       <NA>      NA             1
                    HA-10   20   no <NA> <NA> 2 NA
                    HA-10   20   yes <NA> 0 NA NA",
                    na.strings = c("NA", "<NA>"), header = T)
    
         ID Age Steatosis       Mallory Lille_dico Lille_3 Bili.AHHS2cat
    1 HA-09  16      <NA>          <NA>         NA       5            NA
    2 HA-09  16      <33% no/occasional         NA      NA             1
    3 HA-10  20        no          <NA>         NA       2            NA
    4 HA-10  20       yes          <NA>          0      NA            NA
    

    This can still be accomplished, but the custom function for summarization (let's call it f) gets a little more complicated:

    f <- function(x) {
        x <- x[!is.na(x$value),]
        if (nrow(x) > 0) {
            y <- unique(x[colnames(x) != 'row.ID'])
            y$row.ID <- 1:nrow(y)
            return(y)
        } else {
            return(data.frame())
        }
    }
    

    Notice that this function references a column called "row.ID", which we will create before applying the function:

    library(tidyverse) # gives access to dplyr and tidyr packages
    
    b2 <- gather(b, variable, value, -ID, -Age) %>% # gather the many columns into a simplified key/value pair of columns (one called 'variable', the other, 'value') for each ID
        group_by(ID, variable) %>% # perform subsequent operations per ID and variable
        mutate(row.ID = 1:n()) %>% # add a row identifier
        do(f(.)) %>% # apply our custom function
        spread(variable, value, convert = T) %>% # un-gather the variable/value columns
        ungroup # remove grouping metadata
    
          ID   Age row.ID Bili.AHHS2cat Lille_3 Lille_dico       Mallory Steatosis
    * <fctr> <int>  <int>         <int>   <int>      <int>         <chr>     <chr>
    1  HA-09    16      1             1       5         NA no/occasional      <33%
    2  HA-10    20      1            NA       2          0          <NA>        no
    3  HA-10    20      2            NA      NA         NA          <NA>       yes
    
    0 讨论(0)
提交回复
热议问题