I am using R in Ubuntu, and trying to go over list of files, some of them i need and some of them i don\'t need,
I try to get the one\'s i need by finding a sub string
a
but not aa
You can use the following TRE regex:
^[^a]*a[^a]*$
It matches the start of the string (^
), 0+ chars other than a
([^a]*
), an a
, again 0+ non-'a's and the end of string ($
). See this IDEONE demo:
a <- c("aca","cac","a", "abab", "ab-ab", "ab-cc-ab")
grep("^[^a]*a[^a]*$", a, value=TRUE)
## => [1] "cac" "a"
a
but not aa
If you need to match words that have one a
only, but not two or more a
s inside in any location.
Use this PCRE regex:
\b(?!\w*a\w*a)\w*a\w*\b
See this regex demo.
Explanation:
\b
- word boundary(?!\w*a\w*a)
- a negative lookahead failing the match if there are 0+ word chars, a
, 0+ word chars and a
again right after the word boundary\w*
- 0+ word charsa
- an a
\w*
- 0+ word chars\b
- trailing word boundary.NOTE: Since \w
matches letters, digits and underscores, you might want to change it to \p{L}
or [^\W\d_]
(only matches letters).
See this demo:
a <- c("aca","cac","a")
grep("\\b(?!\\w*a\\w*a)\\w*a\\w*\\b", a, perl=TRUE, value=TRUE)
## => [1] "cac" "a"
In base you can find a string that contains a sub string exactly once when you remove the sub-string with gsub
and test if the remaining string lenght is equal to the searched sub string:
s <- c("a", "aa", "aca", "", "b", "ba", "ab", "cac", "abab", "ab-ab", NA)
ss <- "a" #Substring to find exactly once
nchar(s) - nchar(gsub(ss, "", s)) == nchar(ss)
#[1] TRUE FALSE FALSE FALSE FALSE TRUE TRUE TRUE FALSE FALSE NA
or you count the hits of gregexpr
sapply(gregexpr(ss, s), function(x) sum(x>0)) == 1
# [1] TRUE FALSE FALSE FALSE FALSE TRUE TRUE TRUE FALSE FALSE NA
or as @sebastian-c already mentioned
lengths(regmatches(s, gregexpr(ss, s))) == 1
# [1] TRUE FALSE FALSE FALSE FALSE TRUE TRUE TRUE FALSE FALSE FALSE
or with two grepl
one asking if the sub string is present one time the other if it is there two times:
!grepl("(.*a){2}", s) & grepl("a", s)
# [1] TRUE FALSE FALSE FALSE FALSE TRUE TRUE TRUE FALSE FALSE FALSE
or the same explained in one regex, where (?!(.*a){2})
is a non consuming (zero-width) negative lookahead
grepl("^(?!(.*a){2}).*a.*$", s, perl=TRUE)
# [1] TRUE FALSE FALSE FALSE FALSE TRUE TRUE TRUE FALSE FALSE FALSE
or more general, in case you want to change it to find the sub-string exactly n times
!grepl("(.*a){2}", s) & grepl("(.*a){1}", s)
# [1] TRUE FALSE FALSE FALSE FALSE TRUE TRUE TRUE FALSE FALSE FALSE
grepl("^(?!(.*a){2})(.*a){1}.*$", s, perl=TRUE)
# [1] TRUE FALSE FALSE FALSE FALSE TRUE TRUE TRUE FALSE FALSE FALSE
In case you are looking only for one character you can use the solution form @wiktor-stribiżew
grepl("^[^a]*a[^a]*$", s)
# [1] TRUE FALSE FALSE FALSE FALSE TRUE TRUE TRUE FALSE FALSE FALSE
As I said in comments, grep
looks for a pattern inside your string and there is indeed "a" (or "a{1}", which is the same for grep
) in "aa". You need to add to the pattern that the "a" is followed by not a : "a[^a]"
:
grep("a[^a]", c("aa", "ab"), value=TRUE)
#[1] "ab"
EDIT
Considering your specific problem, it seems you can try by the "opposite" : filter out the strings that contains more than one occurence of the pattern, using a "capture" of the pattern:
!grepl("(ab).+\\1", c("ab.t", "ab-ab.t"))
#[1] TRUE FALSE
!grepl("(ab).*\\1", c("ab", "ab-ab","ab-cc-ab", "abab"))
#[1] TRUE FALSE FALSE FALSE
The brackets permit to capture the pattern (here ab
but it can be any regex), the .*
is for "anything" zero or more times and the \\1
asks for a repeat of the captured pattern
It looks like you're after strings with one a
and no more, regardless where in the string. While stringi
can accomplish the task, a base solution would be:
s <- c("a", "aa", "aca", "", "b", "ba", "ab")
m <- gregexpr("a", s)
s[lengths(regmatches(s, m)) == 1]
[1] "a" "ba" "ab"
Alternatively, a regex-lite approach:
s[vapply(strsplit(s, ""), function(x) sum(x == "a") == 1, logical(1))]
[1] "a" "ba" "ab"
We can use stringi::stri_count
:
library(stringi)
library(purrr)
# simulate some data
set.seed(1492)
(map_chr(1:10, function(i) {
paste0(sample(letters, sample(10:30), replace=TRUE), collapse="")
}) -> strings)
## [1] "jdpcypoizdzvfzs" "gyvcljnfmrzmdmkufq"
## [3] "xqwrmnklbixnccwyaiadrsxn" "bwbenawcwvdevmjfvs"
## [5] "ytzwnpkuromfbklfsdnbwwnlrw" "wclxpzftqgwxyetpsuslgohcdenuj"
## [7] "czkhanefss" "mxsrqrackxvimcxqcqsditrou"
## [9] "ysqshvzjjmwes" "yzawyoqxqxiasensorlenafcbk"
# How many "w"s in each string?
stri_count_regex(strings, "w{1}")
## [1] 0 0 2 3 4 2 0 0 1 1
we can try with ^
and $
to make sure that there is only a single 'a' in the string
grep("^a$", a)
#[1] 1
It is not clear what the OP wanted.