String split with conditions in R

后端 未结 7 1370
一向
一向 2021-02-04 00:05

I have this mystring with the delimiter _. The condition here is if there are two or more delimiters, I want to split at the second delimiter and if th

相关标签:
7条回答
  • 2021-02-04 00:42

    Perl/PCRE has the branch reset feature that lets you reuse a group number when you have capturing groups in different alternatives, and is considered as one capturing group.

    IMO, this feature is elegant when you want to supply different alternatives.

    x <- c('MODY_60.2.ReCal.sort.bam', 'MODY_116.21_C4U.ReCal.sort.bam', 
           'MODY_116.3_C2RX-1-10.ReCal.sort.bam', 'MODY_116.4.ReCal.sort.bam',
           'MODY_116.4_asdfsadf_1212_asfsdf', 'MODY_116.5.ReCal_asdfsadf_1212_asfsdf', 'MODY')
    
    sub('^(?|([^_]*_[^_]*)_.*|(.*)\\.ReCal.*)$', '\\1', x, perl=T)
    # [1] "MODY_60.2"        "MODY_116.21"      "MODY_116.3"       "MODY_116.4"      
    # [5] "MODY_116.4"       "MODY_116.5.ReCal" "MODY"  
    
    0 讨论(0)
  • 2021-02-04 00:43

    gregexpr can search for a pattern in strings and give the location.

    First, we use gregexpr to find the location of all _ in each element of mystring. Then, we loop through that output and extract the index of second _ within each element of mystring. If there is no second _, it'll return an NA (check inds in the example below).

    After that, we can either extract the relevant part using substr based on the extracted index or, if there is NA, we can split the string at .ReCal and keep only the first part.

    inds = sapply(gregexpr("_", mystring, fixed = TRUE), function(x) x[2])
    ifelse(!is.na(inds),
           substr(mystring, 1, inds - 1), 
           sapply(strsplit(mystring, ".ReCal"), '[', 1))
    #[1] "MODY_60.2"   "MODY_116.21" "MODY_116.3"  "MODY_116.4" 
    
    0 讨论(0)
  • 2021-02-04 00:45
    gsub('^(.*\\.\\d+).*','\\1',mystring)
    [1] "MODY_60.2"   "MODY_116.21" "MODY_116.3"  "MODY_116.4"
    
    0 讨论(0)
  • 2021-02-04 00:50
    ^([^_\\n]*_[^_\\n]*)(?:_.*|\\.ReCal[^_]*)$
    

    You can simply do using gsub without using any complex regex.Just replace by \\1.See demo.

    https://regex101.com/r/wL4aB6/1

    0 讨论(0)
  • 2021-02-04 01:01

    With the stringr package:

    str_extract(mystring, '.*?_.*?(?=_)|^.*?_.*(?=\\.ReCal)')
    [1] "MODY_60.2" "MODY_116.21" "MODY_116.3" "MODY_116.4"
    

    It also works with more than two delimiters.

    0 讨论(0)
  • 2021-02-04 01:04

    A little longer, but needs less regular expression knowledge:

    library(stringr)
    indx <- str_locate_all(mystring, "_")
    
    for (i in seq_along(indx)) {
      if (nrow(indx[[i]]) == 1) {
        mystring[i] <- strsplit(mystring[i], ".ReCal")[[1]][1]
      } else {
        mystring[i] <- substr(mystring[i], start = 1, stop = indx[[i]][2] - 1)
      }
    }
    
    0 讨论(0)
提交回复
热议问题