R grep regular expression using elements in a vector (FOLLOW UP)

后端 未结 3 1438
被撕碎了的回忆
被撕碎了的回忆 2021-01-26 23:58

Following up on this question, I have another example where I cannot use the accepted answer.

Again, I want to find each of the exact group elements in the

相关标签:
3条回答
  • 2021-01-27 00:11

    Try this from the stringr package. The "coll" option implements "human readable collation rules" which helps you match things that look identical, but for some reason, R resists matching them at first:

    > library(stringr)
    > str_detect(labs,coll(groups))
     [1]  TRUE FALSE FALSE  TRUE FALSE  TRUE  TRUE FALSE FALSE  TRUE FALSE  TRUE  
    TRUE FALSE FALSE
    [16]  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE
    
    0 讨论(0)
  • 2021-01-27 00:19

    + is a special character in regex. You will need "\+" to escape the special character.

    new_group <- gsub("\\+",replacement = "\\\\+",x =groups)
    

    Also, "|" in grep serves like "or".

    new_group1 <- paste0(new_group,collapse = "|")
    
    grep(pattern = new_group1,x = labs,value = T)
    
    0 讨论(0)
  • 2021-01-27 00:25

    Try

    lapply(groups, function(g)
      grep(gsub("\\+", "\\\\+", paste0(g, "$")), labs, value = TRUE))
    # [[1]]
    # [1] "Beijing -- T0 -- BC-89 + CN"     
    # [2] "Beijing -- T24 -- BC-89 + CN"    
    # [3] "Beijing -- T0 -- BC-89 + CN"     
    # [4] "Zhangjiakou -- T0 -- BC-89 + CN" 
    # [5] "Beijing -- T0 -- BC-89 + CN"     
    # [6] "Beijing -- T0 -- BC-89 + CN"     
    # [7] "Beijing -- T24 -- BC-89 + CN"    
    # [8] "Beijing -- T24 -- BC-89 + CN"    
    # [9] "Zhangjiakou -- T0 -- BC-89 + CN" 
    # [10] "Zhangjiakou -- T0 -- BC-89 + CN" 
    # [11] "Zhangjiakou -- T24 -- BC-89 + CN"
    # [12] "Zhangjiakou -- T24 -- BC-89 + CN"
    # 
    # [[2]]
    # [1] "Beijing -- T0 -- BC-89 + CN with 2% DD + 1.6% ZC"     
    # [2] "Beijing -- T24 -- BC-89 + CN with 2% DD + 1.6% ZC"    
    # [3] "Beijing -- T0 -- BC-89 + CN with 2% DD + 1.6% ZC"     
    # [4] "Zhangjiakou -- T0 -- BC-89 + CN with 2% DD + 1.6% ZC" 
    # [5] "Beijing -- T0 -- BC-89 + CN with 2% DD + 1.6% ZC"     
    # [6] "Beijing -- T24 -- BC-89 + CN with 2% DD + 1.6% ZC"    
    # [7] "Zhangjiakou -- T0 -- BC-89 + CN with 2% DD + 1.6% ZC" 
    # [8] "Zhangjiakou -- T24 -- BC-89 + CN with 2% DD + 1.6% ZC"
    # 
    # [[3]]
    # [1] "Beijing -- T0 -- BC-89 with 2% Puricare + 5% Merquat + CN"    
    # [2] "Beijing -- T24 -- BC-89 with 2% Puricare + 5% Merquat + CN"   
    # [3] "Beijing -- T0 -- BC-89 with 2% Puricare + 5% Merquat + CN"    
    # [4] "Zhangjiakou -- T0 -- BC-89 with 2% Puricare + 5% Merquat + CN"
    

    The problem with your approach is that, e.g., groups[1] is "BC-89 + CN", which contains +, having particular meaning in regular expressions. Given only this, adding fixed = TRUE in grep would fix the issue, but then $ would lose its effect. So what I did is escaping + in the group names first.

    Alternatively, and relating to your linked answer, you could do

    lapply(groups, function(g)
      grep(paste0(g, "$"), paste0(labs, "$"), value = TRUE, fixed = TRUE))
    
    0 讨论(0)
提交回复
热议问题