I have a data.frame with several rows which come from a merge which are not completely merged:
b <- read.table(text = \"
ID Age Steatosis
Here is a base R method that should work, for a version of the data that you provided:
aggregate(b[-grep("^(ID|Age)$", names(b))], b[c("ID", "Age")],
FUN=function(x) if(all(is.na(x))) NA else x[!is.na(x)][1])
ID Age Steatosis Mallory Lille_dico Lille_3 Bili.AHHS2cat
1 HA-09 16 <33% no/occasional NA 5 1
It uses aggregate
together with an if
else
check. This will return the first element that is not missing if any should exist. I take the first element as there is at least one observation. The i
in the code could be replaced by length(x)
to select the last element.
As suggested by @jdobres in a comment to another answer, it would be possible to use paste
with the collapse argument to combine multiple non-missing elements. This, of course would convert the type of the vector to character, which may not be desirable if the variable is numeric.
Note: I edited my original answer to include "Age" in the key, thanks to @sebastian-c for pointing this out.
If "Age" is not part of the key, then
aggregate(b[-grep("^(ID)$", names(b))], b["ID"],
FUN=function(x) if(all(is.na(x))) NA else x[!is.na(x)][1])
will work.
data
b <- read.table(text = "
ID Age Steatosis Mallory Lille_dico Lille_3 Bili.AHHS2cat
68 HA-09 16 NA NA NA 5 NA
69 HA-09 16 <33% no/occasional NA NA 1")
A dplyr
approach using summarise_all
:
## using `na.strings` to identify NA entries in posted data
b <- read.table(text = "
ID Age Steatosis Mallory Lille_dico Lille_3 Bili.AHHS2cat
68 HA-09 16 <NA> <NA> <NA> 5 NA
69 HA-09 16 <33% no/occasional <NA> NA 1", na.strings = c("NA", "<NA>"))
library(dplyr)
f <- function(x) {
x <- na.omit(x)
if (length(x) > 0) first(x) else NA
}
res <- b %>% group_by(ID,Age) %>% summarise_all(funs(f))
##Source: local data frame [1 x 7]
##Groups: ID [?]
##
## ID Age Steatosis Mallory Lille_dico Lille_3 Bili.AHHS2cat
## <fctr> <int> <fctr> <fctr> <lgl> <int> <int>
##1 HA-09 16 <33% no/occasional NA 5 1
The definition of the function is to handle the case where all values is NA
.
As @jdobres suggests, if there are more than one non-NA
values that you want to merge (per each column), you may want to flatten all of these to a string representation using:
library(dplyr)
f <- function(x) {
x <- na.omit(x)
if (length(x) > 0) paste(x,collapse='-') else NA
}
res <- b %>% group_by(ID,Age) %>% summarise_all(funs(f))
In your posted data, the result would be the same as above because all columns that are summarized has at most one non-NA
value.
While I'm sure that it's possible with dplyr
or tidyr
, here's a data.table
solution:
b <- read.table(text = "
ID Age Steatosis Mallory Lille_dico Lille_3 Bili.AHHS2cat
68 HA-09 16 <NA> <NA> <NA> 5 NA
69 HA-09 16 <33% no/occasional <NA> NA 1",
na.strings = c("NA", "<NA>"))
keycols <- c("ID", "Age")
library(data.table)
b_dt <- data.table(b)
filter_nas <- function(x){
if(all(is.na(x))){
return(unique(x))
}
return(unique(x[!is.na(x)]))
}
b_dt[, lapply(.SD, filter_nas ), by = mget(keycols)]
ID Age Steatosis Mallory Lille_dico Lille_3 Bili.AHHS2cat
1: HA-09 16 <33% no/occasional NA 5 1
Note, this only works if the keys are unique.
Llopis's request to keep both rows if a given ID has different information for a column complicates matters. First let's create some example data that illustrates the situation:
b <- read.table(text = "ID Age Steatosis Mallory Lille_dico Lille_3 Bili.AHHS2cat
HA-09 16 <NA> <NA> <NA> 5 NA
HA-09 16 <33% no/occasional <NA> NA 1
HA-10 20 no <NA> <NA> 2 NA
HA-10 20 yes <NA> 0 NA NA",
na.strings = c("NA", "<NA>"), header = T)
ID Age Steatosis Mallory Lille_dico Lille_3 Bili.AHHS2cat
1 HA-09 16 <NA> <NA> NA 5 NA
2 HA-09 16 <33% no/occasional NA NA 1
3 HA-10 20 no <NA> NA 2 NA
4 HA-10 20 yes <NA> 0 NA NA
This can still be accomplished, but the custom function for summarization (let's call it f
) gets a little more complicated:
f <- function(x) {
x <- x[!is.na(x$value),]
if (nrow(x) > 0) {
y <- unique(x[colnames(x) != 'row.ID'])
y$row.ID <- 1:nrow(y)
return(y)
} else {
return(data.frame())
}
}
Notice that this function references a column called "row.ID", which we will create before applying the function:
library(tidyverse) # gives access to dplyr and tidyr packages
b2 <- gather(b, variable, value, -ID, -Age) %>% # gather the many columns into a simplified key/value pair of columns (one called 'variable', the other, 'value') for each ID
group_by(ID, variable) %>% # perform subsequent operations per ID and variable
mutate(row.ID = 1:n()) %>% # add a row identifier
do(f(.)) %>% # apply our custom function
spread(variable, value, convert = T) %>% # un-gather the variable/value columns
ungroup # remove grouping metadata
ID Age row.ID Bili.AHHS2cat Lille_3 Lille_dico Mallory Steatosis
* <fctr> <int> <int> <int> <int> <int> <chr> <chr>
1 HA-09 16 1 1 5 NA no/occasional <33%
2 HA-10 20 1 NA 2 0 <NA> no
3 HA-10 20 2 NA NA NA <NA> yes