I\'m trying to filter a patient database based on specific ICD9 (diagnosis) codes. I would like to use a vector indicating the first 3 strings of the ICD9 codes.
The exa
This looks close to what you're looking for, but requires a bit more manipulation:
library(dplyr)
library(stringr)
library(tidyr)
obs2 <- observations %>%
gather(vars, value, -patient) %>%
filter(str_sub(value, 1, 3) %in% dx)
# A tibble: 2 × 3
patient vars value
<chr> <chr> <chr>
1 a var1 8661
2 b var2 8674
You can make a regex pattern from the interest vector and apply it to each column of your data frame except for the patient
id, use rowSums
to check if there is any var in a row match the pattern:
library(dplyr)
pattern = paste("^(", paste0(dx, collapse = "|"), ")", sep = "")
pattern
# [1] "^(866|867)"
filter(observations, rowSums(sapply(observations[-1], grepl, pattern = pattern)) != 0)
# A tibble: 2 × 4
# patient var1 var2 var3
# <chr> <chr> <chr> <chr>
#1 a 8661 8651 2430
#2 b 865 8674 3456
Another option is to use Reduce
with lapply
:
filter(observations, Reduce("|", lapply(observations[-1], grepl, pattern = pattern)))
# A tibble: 2 × 4
# patient var1 var2 var3
# <chr> <chr> <chr> <chr>
#1 a 8661 8651 2430
#2 b 865 8674 3456
This approach works when you have more then two patterns and different patterns have different character length, for instance, if you have dx
as dx<-c("866","867", "9089")
:
dx<-c("866","867", "9089")
pattern = paste("^(", paste0(dx, collapse = "|"), ")", sep = "")
pattern
# [1] "^(866|867|9089)"
filter(observations, Reduce("|", lapply(observations[-1], grepl, pattern = pattern)))
# A tibble: 3 × 4
# patient var1 var2 var3
# <chr> <chr> <chr> <chr>
#1 a 8661 8651 2430
#2 b 865 8674 3456
#3 c 8651 2866 9089
Check this and this stack answer for more about multiple or conditions in regex.
You can use apply and ldply
library(plyr)
filtered_obs <- apply(observations, 1, function(x) if(sum(substr(x,1,3) %in% dx)>0){x})
filtered_obs <- plyr::ldply(filtered_obs,rbind)
If you have variable number of characters then this should work-
filtered_obs <- lapply(dx, function(y)
{
plyr::ldply(apply(observations, 1, function(x)
{
if(sum(substr(x,1,nchar(y)) %in% y)>0){x}
}), rbind)
})
filtered_obs <- unique(plyr::ldply(filtered_obs,rbind))