r grep by regex - finding a string that contains a sub string exactly one once

前端未结

关注

 6  1890

暗喜

I am using R in Ubuntu, and trying to go over list of files, some of them i need and some of them i don\'t need,

I try to get the one\'s i need by finding a sub string

相关标签:

6条回答

无人共我

2021-01-25 22:28
Detecting strings with a but not aa

You can use the following TRE regex:
```
^[^a]*a[^a]*$
```
It matches the start of the string (^), 0+ chars other than a ([^a]*), an a, again 0+ non-'a's and the end of string ($). See this IDEONE demo:
```
a <- c("aca","cac","a", "abab", "ab-ab", "ab-cc-ab")
grep("^[^a]*a[^a]*$", a, value=TRUE)
## => [1] "cac" "a"
```
Finding Whole Word Containing a but not aa

If you need to match words that have one a only, but not two or more as inside in any location.

Use this PCRE regex:
```
\b(?!\w*a\w*a)\w*a\w*\b
```
See this regex demo.

Explanation:
- \b - word boundary
- (?!\w*a\w*a) - a negative lookahead failing the match if there are 0+ word chars, a, 0+ word chars and a again right after the word boundary
- \w* - 0+ word chars
- a - an a
- \w* - 0+ word chars
- \b - trailing word boundary.
NOTE: Since \w matches letters, digits and underscores, you might want to change it to \p{L} or [^\W\d_] (only matches letters).

See this demo:
```
a <- c("aca","cac","a")
grep("\\b(?!\\w*a\\w*a)\\w*a\\w*\\b", a, perl=TRUE, value=TRUE)
## => [1] "cac" "a"  
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

别跟我提以往

2021-01-25 22:36

In base you can find a string that contains a sub string exactly once when you remove the sub-string with gsub and test if the remaining string lenght is equal to the searched sub string:

s <- c("a", "aa", "aca", "", "b", "ba", "ab", "cac", "abab", "ab-ab", NA)
ss  <- "a" #Substring to find exactly once

nchar(s) - nchar(gsub(ss, "", s)) == nchar(ss)
#[1]  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE FALSE    NA

or you count the hits of gregexpr

sapply(gregexpr(ss, s), function(x) sum(x>0)) == 1
# [1]  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE FALSE    NA

or as @sebastian-c already mentioned

lengths(regmatches(s, gregexpr(ss, s))) == 1
# [1]  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE FALSE FALSE

or with two grepl one asking if the sub string is present one time the other if it is there two times:

!grepl("(.*a){2}", s) & grepl("a", s)
# [1]  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE FALSE FALSE

or the same explained in one regex, where (?!(.*a){2}) is a non consuming (zero-width) negative lookahead

grepl("^(?!(.*a){2}).*a.*$", s, perl=TRUE)
# [1]  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE FALSE FALSE

or more general, in case you want to change it to find the sub-string exactly n times

!grepl("(.*a){2}", s) & grepl("(.*a){1}", s)
# [1]  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE FALSE FALSE

grepl("^(?!(.*a){2})(.*a){1}.*$", s, perl=TRUE)
# [1]  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE FALSE FALSE

In case you are looking only for one character you can use the solution form @wiktor-stribiżew

grepl("^[^a]*a[^a]*$", s)
# [1]  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE FALSE FALSE

0 讨论(0)

感情败类

2021-01-25 22:39
As I said in comments, grep looks for a pattern inside your string and there is indeed "a" (or "a{1}", which is the same for grep) in "aa". You need to add to the pattern that the "a" is followed by not a : "a[^a]":
```
grep("a[^a]", c("aa", "ab"), value=TRUE)
#[1] "ab"
```
EDIT

Considering your specific problem, it seems you can try by the "opposite" : filter out the strings that contains more than one occurence of the pattern, using a "capture" of the pattern:
```
!grepl("(ab).+\\1", c("ab.t", "ab-ab.t"))
#[1]  TRUE FALSE

!grepl("(ab).*\\1", c("ab", "ab-ab","ab-cc-ab", "abab"))
#[1]  TRUE FALSE FALSE FALSE
```
The brackets permit to capture the pattern (here ab but it can be any regex), the .* is for "anything" zero or more times and the \\1 asks for a repeat of the captured pattern
0 讨论(0)
发布评论:

提交评论
- 加载中...
粉色の甜心

2021-01-25 22:40
It looks like you're after strings with one a and no more, regardless where in the string. While stringi can accomplish the task, a base solution would be:
```
s <- c("a", "aa", "aca", "", "b", "ba", "ab")

m <- gregexpr("a", s)
s[lengths(regmatches(s, m)) == 1]

[1] "a"  "ba" "ab"
```
Alternatively, a regex-lite approach:
```
s[vapply(strsplit(s, ""), function(x) sum(x == "a") == 1, logical(1))]
[1] "a"  "ba" "ab"
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

面向向阳花

2021-01-25 22:44

We can use stringi::stri_count:

library(stringi)
library(purrr)

# simulate some data
set.seed(1492)
(map_chr(1:10, function(i) {
  paste0(sample(letters, sample(10:30), replace=TRUE), collapse="")
}) -> strings)

## [1] "jdpcypoizdzvfzs"               "gyvcljnfmrzmdmkufq"           
## [3] "xqwrmnklbixnccwyaiadrsxn"      "bwbenawcwvdevmjfvs"           
## [5] "ytzwnpkuromfbklfsdnbwwnlrw"    "wclxpzftqgwxyetpsuslgohcdenuj"
## [7] "czkhanefss"                    "mxsrqrackxvimcxqcqsditrou"    
## [9] "ysqshvzjjmwes"                 "yzawyoqxqxiasensorlenafcbk" 

# How many "w"s in each string?
stri_count_regex(strings, "w{1}")

## [1] 0 0 2 3 4 2 0 0 1 1

0 讨论(0)

别那么骄傲

2021-01-25 22:53
we can try with ^ and $ to make sure that there is only a single 'a' in the string
```
grep("^a$", a)
#[1] 1
```
It is not clear what the OP wanted.
0 讨论(0)
发布评论:

提交评论
- 加载中...

r grep by regex - finding a string that contains a sub string exactly one once

Detecting strings with `a` but not `aa`

Finding Whole Word Containing `a` but not `aa`

r grep by regex - finding a string that contains a sub string exactly one once

Detecting strings with a but not aa

Finding Whole Word Containing a but not aa

Detecting strings with `a` but not `aa`

Finding Whole Word Containing `a` but not `aa`