Filter rows based on variables “beginning with” strings specified by vector

前端 未结 3 508
梦如初夏
梦如初夏 2021-01-22 08:00

I\'m trying to filter a patient database based on specific ICD9 (diagnosis) codes. I would like to use a vector indicating the first 3 strings of the ICD9 codes.

The exa

相关标签:
3条回答
  • 2021-01-22 08:28

    This looks close to what you're looking for, but requires a bit more manipulation:

    library(dplyr)
    library(stringr)
    library(tidyr)
    
    obs2 <- observations %>%
      gather(vars, value, -patient) %>%
      filter(str_sub(value, 1, 3) %in% dx)
    
    # A tibble: 2 × 3
      patient  vars value
        <chr> <chr> <chr>
    1       a  var1  8661
    2       b  var2  8674
    
    0 讨论(0)
  • 2021-01-22 08:30

    You can make a regex pattern from the interest vector and apply it to each column of your data frame except for the patient id, use rowSums to check if there is any var in a row match the pattern:

    library(dplyr)
    pattern = paste("^(", paste0(dx, collapse = "|"), ")", sep = "")
    
    pattern
    # [1] "^(866|867)"
    
    filter(observations, rowSums(sapply(observations[-1], grepl, pattern = pattern)) != 0)
    
    # A tibble: 2 × 4
    #  patient  var1  var2  var3
    #    <chr> <chr> <chr> <chr>
    #1       a  8661  8651  2430
    #2       b   865  8674  3456
    

    Another option is to use Reduce with lapply:

    filter(observations, Reduce("|", lapply(observations[-1], grepl, pattern = pattern)))
    
    # A tibble: 2 × 4
    #  patient  var1  var2  var3
    #    <chr> <chr> <chr> <chr>
    #1       a  8661  8651  2430
    #2       b   865  8674  3456
    

    This approach works when you have more then two patterns and different patterns have different character length, for instance, if you have dx as dx<-c("866","867", "9089"):

    dx<-c("866","867", "9089")
    pattern = paste("^(", paste0(dx, collapse = "|"), ")", sep = "")
    
    pattern
    # [1] "^(866|867|9089)"
    
    filter(observations, Reduce("|", lapply(observations[-1], grepl, pattern = pattern)))
    
    # A tibble: 3 × 4
    #  patient  var1  var2  var3
    #    <chr> <chr> <chr> <chr>
    #1       a  8661  8651  2430
    #2       b   865  8674  3456
    #3       c  8651  2866  9089
    

    Check this and this stack answer for more about multiple or conditions in regex.

    0 讨论(0)
  • 2021-01-22 08:41

    You can use apply and ldply

    library(plyr)
    filtered_obs <- apply(observations, 1, function(x) if(sum(substr(x,1,3) %in% dx)>0){x})
    filtered_obs <- plyr::ldply(filtered_obs,rbind)
    

    If you have variable number of characters then this should work-

    filtered_obs <- lapply(dx, function(y)
                     {
                      plyr::ldply(apply(observations, 1, function(x) 
                       {
                        if(sum(substr(x,1,nchar(y)) %in% y)>0){x}
                       }), rbind)
                     })
    
    filtered_obs <- unique(plyr::ldply(filtered_obs,rbind))
    
    0 讨论(0)
提交回复
热议问题