I know this is a duplicate Q but I can\'t seem to find the post again
Using the following data
df <- data.frame(A=c(1,1,2,2),B=c(NA,2,NA,4),C=c(3,
We can use fill
to fill all the missing values. And then filter just one row for each group.
library(dplyr)
library(tidyr)
df2 <- df %>%
group_by(A) %>%
fill(everything(), .direction = "down") %>%
fill(everything(), .direction = "up") %>%
slice(1)
I haven't figured out how to put the coalesce_by_column
function inside the dplyr
pipeline, but this works:
coalesce_by_column <- function(df) {
return(coalesce(df[1], df[2]))
}
df %>%
group_by(A) %>%
summarise_all(coalesce_by_column)
## A B C D E
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 2 3 2 5
## 2 2 4 5 3 4
Edit: include @Jon Harmon's solution for more than 2 members of a group
# Supply lists by splicing them into dots:
coalesce_by_column <- function(df) {
return(dplyr::coalesce(!!! as.list(df)))
}
df %>%
group_by(A) %>%
summarise_all(coalesce_by_column)
#> # A tibble: 2 x 5
#> A B C D E
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 2 3 2 5
#> 2 2 4 5 3 4
Not tidyverse
but here's one base R solution
df <- data.frame(A=c(1,1),B=c(NA,2),C=c(3,NA),D=c(NA,2),E=c(5,NA))
sapply(df, function(x) x[!is.na(x)][1])
#A B C D E
#1 2 3 2 5
With updated data
do.call(rbind, lapply(split(df, df$A), function(a) sapply(a, function(x) x[!is.na(x)][1])))
# A B C D E
#1 1 2 3 2 5
#2 2 4 5 3 4
A different tidyverse
possibility could be:
df %>%
gather(var, val, -A, na.rm = TRUE) %>%
group_by(A, var) %>%
distinct(val) %>%
spread(var, val)
A B C D E
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 2 3 2 5
2 2 4 5 3 4
Here it, first, performs a wide-to-long data-transformation, excluding the "A" column and removing the missing values. Second, it groups by "A" column and the variable names. Third, it removes the duplicate values. Finally, it returns the data to its original wide format.
Here is an even more general solution (using unique
, na.omit
to sort of create coalesce
), which can handle more than two rows with overlapping information. Super simply and forward.
> df <- data.frame(A=c(1,1,2,2,2),B=c(NA,2,NA,4,4),C=c(3,NA,NA,5,NA),D=c(NA,2,3,NA,NA),E=c(5,NA,NA,4,4))
> df
A B C D E
1 1 NA 3 NA 5
2 1 2 NA 2 NA
3 2 NA NA 3 NA
4 2 4 5 NA 4
5 2 4 NA NA 4
> df %>% group_by(A) %>% summarise_all(funs( na.omit(unique(.)) ))
# A tibble: 2 x 5
A B C D E
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 2 3 2 5
2 2 4 5 3 4
This is functionally identical to @Oriol Mirosa's answer without requiring a custom function:
EDIT: NAs must be omitted as per @thelatemail's comment. This answer was also given by @MrFlick in the duplicate thread linked above.
df %>% group_by(A) %>% summarise_all(~first(na.omit(.)))
I wanted to add to this as it seems to come up regularly for me and I've revisited this thread many times. @Oriol Mirosa's answer works, however I'm resistant to it because it's just complex enough to be difficult to remember (hence my return to this thread).
Personally, I also don't like writing small custom functions like if I don't need to. Attempting to substitute coalesce_by_column
with the actual coalesce
call results in type errors (which I find strange as the rows aren't interacting with each other but whatever). This can be resolved by first doing mutate_all(as.character)
, however my goal here is to minimize syntax so it's easily remembered on the fly.
Furthermore, this substitution changes the behavior such that non-identical values within a column throws an error (why things sometimes behave slightly differently within a function is beyond me). This behavior may be preferred in some situations, however in that case I would recommend @Jerry T's solution as there is no custom function and the ones used are familiar, readable, and the ordering of them (na.omit
and unique
) isn't relevant.