combine rows in data frame containing NA to make complete row

前端 未结 6 1926
不思量自难忘°
不思量自难忘° 2020-11-29 10:17

I know this is a duplicate Q but I can\'t seem to find the post again

Using the following data

df <- data.frame(A=c(1,1,2,2),B=c(NA,2,NA,4),C=c(3,         


        
相关标签:
6条回答
  • 2020-11-29 10:42

    We can use fill to fill all the missing values. And then filter just one row for each group.

    library(dplyr)
    library(tidyr)
    
    df2 <- df %>%
      group_by(A) %>%
      fill(everything(), .direction = "down") %>%
      fill(everything(), .direction = "up") %>%
      slice(1)
    
    0 讨论(0)
  • 2020-11-29 10:53

    I haven't figured out how to put the coalesce_by_column function inside the dplyr pipeline, but this works:

    coalesce_by_column <- function(df) {
      return(coalesce(df[1], df[2]))
    }
    
    df %>%
      group_by(A) %>%
      summarise_all(coalesce_by_column)
    
    ##       A     B     C     D     E
    ##   <dbl> <dbl> <dbl> <dbl> <dbl>
    ## 1     1     2     3     2     5
    ## 2     2     4     5     3     4
    

    Edit: include @Jon Harmon's solution for more than 2 members of a group

    # Supply lists by splicing them into dots:
    coalesce_by_column <- function(df) {
      return(dplyr::coalesce(!!! as.list(df)))
    }
    
    df %>%
      group_by(A) %>%
      summarise_all(coalesce_by_column)
    
    #> # A tibble: 2 x 5
    #>       A     B     C     D     E
    #>   <dbl> <dbl> <dbl> <dbl> <dbl>
    #> 1     1     2     3     2     5
    #> 2     2     4     5     3     4
    
    0 讨论(0)
  • 2020-11-29 10:56

    Not tidyverse but here's one base R solution

    df <- data.frame(A=c(1,1),B=c(NA,2),C=c(3,NA),D=c(NA,2),E=c(5,NA))
    sapply(df, function(x) x[!is.na(x)][1])
    #A B C D E 
    #1 2 3 2 5 
    

    With updated data

    do.call(rbind, lapply(split(df, df$A), function(a) sapply(a, function(x) x[!is.na(x)][1])))
    #  A B C D E
    #1 1 2 3 2 5
    #2 2 4 5 3 4
    
    0 讨论(0)
  • 2020-11-29 10:57

    A different tidyverse possibility could be:

    df %>%
     gather(var, val, -A, na.rm = TRUE) %>%
     group_by(A, var) %>%
     distinct(val) %>%
     spread(var, val)
    
          A     B     C     D     E
      <dbl> <dbl> <dbl> <dbl> <dbl>
    1     1     2     3     2     5
    2     2     4     5     3     4
    

    Here it, first, performs a wide-to-long data-transformation, excluding the "A" column and removing the missing values. Second, it groups by "A" column and the variable names. Third, it removes the duplicate values. Finally, it returns the data to its original wide format.

    0 讨论(0)
  • 2020-11-29 11:03

    Here is an even more general solution (using unique, na.omit to sort of create coalesce), which can handle more than two rows with overlapping information. Super simply and forward.

    > df <- data.frame(A=c(1,1,2,2,2),B=c(NA,2,NA,4,4),C=c(3,NA,NA,5,NA),D=c(NA,2,3,NA,NA),E=c(5,NA,NA,4,4))
    
    > df
      A  B  C  D  E
    1 1 NA  3 NA  5
    2 1  2 NA  2 NA
    3 2 NA NA  3 NA
    4 2  4  5 NA  4
    5 2  4 NA NA  4
    
    > df %>% group_by(A) %>% summarise_all(funs( na.omit(unique(.)) ))
    # A tibble: 2 x 5
          A     B     C     D     E
      <dbl> <dbl> <dbl> <dbl> <dbl>
    1     1     2     3     2     5
    2     2     4     5     3     4
    
    0 讨论(0)
  • 2020-11-29 11:05

    This is functionally identical to @Oriol Mirosa's answer without requiring a custom function:

    EDIT: NAs must be omitted as per @thelatemail's comment. This answer was also given by @MrFlick in the duplicate thread linked above.

    df %>% group_by(A) %>% summarise_all(~first(na.omit(.)))
    

    I wanted to add to this as it seems to come up regularly for me and I've revisited this thread many times. @Oriol Mirosa's answer works, however I'm resistant to it because it's just complex enough to be difficult to remember (hence my return to this thread).

    Personally, I also don't like writing small custom functions like if I don't need to. Attempting to substitute coalesce_by_column with the actual coalesce call results in type errors (which I find strange as the rows aren't interacting with each other but whatever). This can be resolved by first doing mutate_all(as.character), however my goal here is to minimize syntax so it's easily remembered on the fly.

    Furthermore, this substitution changes the behavior such that non-identical values within a column throws an error (why things sometimes behave slightly differently within a function is beyond me). This behavior may be preferred in some situations, however in that case I would recommend @Jerry T's solution as there is no custom function and the ones used are familiar, readable, and the ordering of them (na.omit and unique) isn't relevant.

    0 讨论(0)
提交回复
热议问题