r grep by regex - finding a string that contains a sub string exactly one once

前端 未结 6 1875
暗喜
暗喜 2021-01-25 22:10

I am using R in Ubuntu, and trying to go over list of files, some of them i need and some of them i don\'t need,

I try to get the one\'s i need by finding a sub string

相关标签:
6条回答
  • 2021-01-25 22:28

    Detecting strings with a but not aa

    You can use the following TRE regex:

    ^[^a]*a[^a]*$
    

    It matches the start of the string (^), 0+ chars other than a ([^a]*), an a, again 0+ non-'a's and the end of string ($). See this IDEONE demo:

    a <- c("aca","cac","a", "abab", "ab-ab", "ab-cc-ab")
    grep("^[^a]*a[^a]*$", a, value=TRUE)
    ## => [1] "cac" "a"
    

    Finding Whole Word Containing a but not aa

    If you need to match words that have one a only, but not two or more as inside in any location.

    Use this PCRE regex:

    \b(?!\w*a\w*a)\w*a\w*\b
    

    See this regex demo.

    Explanation:

    • \b - word boundary
    • (?!\w*a\w*a) - a negative lookahead failing the match if there are 0+ word chars, a, 0+ word chars and a again right after the word boundary
    • \w* - 0+ word chars
    • a - an a
    • \w* - 0+ word chars
    • \b - trailing word boundary.

    NOTE: Since \w matches letters, digits and underscores, you might want to change it to \p{L} or [^\W\d_] (only matches letters).

    See this demo:

    a <- c("aca","cac","a")
    grep("\\b(?!\\w*a\\w*a)\\w*a\\w*\\b", a, perl=TRUE, value=TRUE)
    ## => [1] "cac" "a"  
    
    0 讨论(0)
  • 2021-01-25 22:36

    In base you can find a string that contains a sub string exactly once when you remove the sub-string with gsub and test if the remaining string lenght is equal to the searched sub string:

    s <- c("a", "aa", "aca", "", "b", "ba", "ab", "cac", "abab", "ab-ab", NA)
    ss  <- "a" #Substring to find exactly once
    
    nchar(s) - nchar(gsub(ss, "", s)) == nchar(ss)
    #[1]  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE FALSE    NA
    

    or you count the hits of gregexpr

    sapply(gregexpr(ss, s), function(x) sum(x>0)) == 1
    # [1]  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE FALSE    NA
    

    or as @sebastian-c already mentioned

    lengths(regmatches(s, gregexpr(ss, s))) == 1
    # [1]  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE FALSE FALSE
    

    or with two grepl one asking if the sub string is present one time the other if it is there two times:

    !grepl("(.*a){2}", s) & grepl("a", s)
    # [1]  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE FALSE FALSE
    

    or the same explained in one regex, where (?!(.*a){2}) is a non consuming (zero-width) negative lookahead

    grepl("^(?!(.*a){2}).*a.*$", s, perl=TRUE)
    # [1]  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE FALSE FALSE
    

    or more general, in case you want to change it to find the sub-string exactly n times

    !grepl("(.*a){2}", s) & grepl("(.*a){1}", s)
    # [1]  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE FALSE FALSE
    
    grepl("^(?!(.*a){2})(.*a){1}.*$", s, perl=TRUE)
    # [1]  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE FALSE FALSE
    

    In case you are looking only for one character you can use the solution form @wiktor-stribiżew

    grepl("^[^a]*a[^a]*$", s)
    # [1]  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE FALSE FALSE
    
    0 讨论(0)
  • 2021-01-25 22:39

    As I said in comments, grep looks for a pattern inside your string and there is indeed "a" (or "a{1}", which is the same for grep) in "aa". You need to add to the pattern that the "a" is followed by not a : "a[^a]":

    grep("a[^a]", c("aa", "ab"), value=TRUE)
    #[1] "ab"
    

    EDIT

    Considering your specific problem, it seems you can try by the "opposite" : filter out the strings that contains more than one occurence of the pattern, using a "capture" of the pattern:

    !grepl("(ab).+\\1", c("ab.t", "ab-ab.t"))
    #[1]  TRUE FALSE
    
    !grepl("(ab).*\\1", c("ab", "ab-ab","ab-cc-ab", "abab"))
    #[1]  TRUE FALSE FALSE FALSE
    

    The brackets permit to capture the pattern (here ab but it can be any regex), the .* is for "anything" zero or more times and the \\1 asks for a repeat of the captured pattern

    0 讨论(0)
  • 2021-01-25 22:40

    It looks like you're after strings with one a and no more, regardless where in the string. While stringi can accomplish the task, a base solution would be:

    s <- c("a", "aa", "aca", "", "b", "ba", "ab")
    
    m <- gregexpr("a", s)
    s[lengths(regmatches(s, m)) == 1]
    
    [1] "a"  "ba" "ab"
    

    Alternatively, a regex-lite approach:

    s[vapply(strsplit(s, ""), function(x) sum(x == "a") == 1, logical(1))]
    [1] "a"  "ba" "ab"
    
    0 讨论(0)
  • 2021-01-25 22:44

    We can use stringi::stri_count:

    library(stringi)
    library(purrr)
    
    # simulate some data
    set.seed(1492)
    (map_chr(1:10, function(i) {
      paste0(sample(letters, sample(10:30), replace=TRUE), collapse="")
    }) -> strings)
    
    ## [1] "jdpcypoizdzvfzs"               "gyvcljnfmrzmdmkufq"           
    ## [3] "xqwrmnklbixnccwyaiadrsxn"      "bwbenawcwvdevmjfvs"           
    ## [5] "ytzwnpkuromfbklfsdnbwwnlrw"    "wclxpzftqgwxyetpsuslgohcdenuj"
    ## [7] "czkhanefss"                    "mxsrqrackxvimcxqcqsditrou"    
    ## [9] "ysqshvzjjmwes"                 "yzawyoqxqxiasensorlenafcbk" 
    
    # How many "w"s in each string?
    stri_count_regex(strings, "w{1}")
    
    ## [1] 0 0 2 3 4 2 0 0 1 1
    
    0 讨论(0)
  • 2021-01-25 22:53

    we can try with ^ and $ to make sure that there is only a single 'a' in the string

    grep("^a$", a)
    #[1] 1
    

    It is not clear what the OP wanted.

    0 讨论(0)
提交回复
热议问题