R grep regular expression using elements in a vector (FOLLOW UP)

后端未结

关注

 3  1440

被撕碎了的回忆

Following up on this question, I have another example where I cannot use the accepted answer.

Again, I want to find each of the exact group elements in the

相关标签:

3条回答

忘了有多久

2021-01-27 00:11
Try this from the stringr package. The "coll" option implements "human readable collation rules" which helps you match things that look identical, but for some reason, R resists matching them at first:
```
> library(stringr)
> str_detect(labs,coll(groups))
 [1]  TRUE FALSE FALSE  TRUE FALSE  TRUE  TRUE FALSE FALSE  TRUE FALSE  TRUE  
TRUE FALSE FALSE
[16]  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
余生分开走

2021-01-27 00:19
+ is a special character in regex. You will need "\+" to escape the special character.
```
new_group <- gsub("\\+",replacement = "\\\\+",x =groups)
```
Also, "|" in grep serves like "or".
```
new_group1 <- paste0(new_group,collapse = "|")

grep(pattern = new_group1,x = labs,value = T)
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

盖世英雄少女心

2021-01-27 00:25

Try

lapply(groups, function(g)
  grep(gsub("\\+", "\\\\+", paste0(g, "$")), labs, value = TRUE))
# [[1]]
# [1] "Beijing -- T0 -- BC-89 + CN"     
# [2] "Beijing -- T24 -- BC-89 + CN"    
# [3] "Beijing -- T0 -- BC-89 + CN"     
# [4] "Zhangjiakou -- T0 -- BC-89 + CN" 
# [5] "Beijing -- T0 -- BC-89 + CN"     
# [6] "Beijing -- T0 -- BC-89 + CN"     
# [7] "Beijing -- T24 -- BC-89 + CN"    
# [8] "Beijing -- T24 -- BC-89 + CN"    
# [9] "Zhangjiakou -- T0 -- BC-89 + CN" 
# [10] "Zhangjiakou -- T0 -- BC-89 + CN" 
# [11] "Zhangjiakou -- T24 -- BC-89 + CN"
# [12] "Zhangjiakou -- T24 -- BC-89 + CN"
# 
# [[2]]
# [1] "Beijing -- T0 -- BC-89 + CN with 2% DD + 1.6% ZC"     
# [2] "Beijing -- T24 -- BC-89 + CN with 2% DD + 1.6% ZC"    
# [3] "Beijing -- T0 -- BC-89 + CN with 2% DD + 1.6% ZC"     
# [4] "Zhangjiakou -- T0 -- BC-89 + CN with 2% DD + 1.6% ZC" 
# [5] "Beijing -- T0 -- BC-89 + CN with 2% DD + 1.6% ZC"     
# [6] "Beijing -- T24 -- BC-89 + CN with 2% DD + 1.6% ZC"    
# [7] "Zhangjiakou -- T0 -- BC-89 + CN with 2% DD + 1.6% ZC" 
# [8] "Zhangjiakou -- T24 -- BC-89 + CN with 2% DD + 1.6% ZC"
# 
# [[3]]
# [1] "Beijing -- T0 -- BC-89 with 2% Puricare + 5% Merquat + CN"    
# [2] "Beijing -- T24 -- BC-89 with 2% Puricare + 5% Merquat + CN"   
# [3] "Beijing -- T0 -- BC-89 with 2% Puricare + 5% Merquat + CN"    
# [4] "Zhangjiakou -- T0 -- BC-89 with 2% Puricare + 5% Merquat + CN"

The problem with your approach is that, e.g., groups[1] is "BC-89 + CN", which contains +, having particular meaning in regular expressions. Given only this, adding fixed = TRUE in grep would fix the issue, but then $ would lose its effect. So what I did is escaping + in the group names first.

Alternatively, and relating to your linked answer, you could do

lapply(groups, function(g)
  grep(paste0(g, "$"), paste0(labs, "$"), value = TRUE, fixed = TRUE))

0 讨论(0)