R: find largest common substring starting at the beginning

后端 未结 11 2677
星月不相逢
星月不相逢 2021-02-19 18:33

I\'ve got 2 vectors:

word1 <- \"bestelling\"   
word2 <- \"bestelbon\"

Now I want to find the largest common substring that starts at the

相关标签:
11条回答
  • 2021-02-19 18:48

    A bit of regex can do this:

    sub('^([^|]*)[^|]*(?:\\|\\1[^|]*)$', '\\1', paste0(word1, '|', word2))
    #[1] "bestel"
    

    I used | as a separator - pick one that makes sense for your strings.

    0 讨论(0)
  • 2021-02-19 18:49
    flodel <- function(word1, word2) {
       # the length of the shorter word
       n <- min(nchar(word1), nchar(word2))
       # two vectors of characters of the same length n
       c1 <- strsplit(word1, "", fixed = TRUE)[[1]][1:n]
       c2 <- strsplit(word2, "", fixed = TRUE)[[1]][1:n]
       # a vector that is TRUE as long as the characters match
       m <- as.logical(cumprod(c1 == c2))
       # the answer
       paste(c1[m], collapse = "")
    }
    
    0 讨论(0)
  • 2021-02-19 18:49

    I realize I'm coming late to this party but determining pairwise alignment is a fundamental problem in biological research and there is already a package (or a package-family) that attacks this problem. The Bioconductor package named Biostrings is available (and it is big at least if you install all the default dependencies, so patience is needed in the install process). It returns S4 objects so different extraction functions are needed. This is perhaps a sledgehammer to extract a nut, but here's the code to give the desired result:

    install.packages("Biostrings", repo="http://www.bioconductor.org/packages/2.14/bioc/", dependencies=TRUE)
    library(Biostrings)
    psa1 <- pairwiseAlignment(pattern = c(word1) ,word2,type="local")
    psa1@pattern
    #[1] bestel 
    

    However, it is not set up to default to restriction of the match to alignment at the first character for both strings. We can hope @MartinMorgan will come along a fix my errors.

    0 讨论(0)
  • 2021-02-19 18:51

    As much as I generally avoid a for loop in R - given you're starting at the beginning and continuing until you find the solution it seemed an easy approach.

    It's a bit more intuitive than some of the other examples I think

    lcsB <- function(string1, string2) {
        x <- ''
        for (i in 1:nchar(string1)){
            if (substr(string1[1],1,i) == substr(string2[1],1,i)) {
                x <- substr(string1[1],1,i)
            }
            else
                return(x)
            }
        return(x)
    }
    
    lcsB("bestelling", "bestelbon")
    lcsB("bestelling", "stel")
    
    0 讨论(0)
  • 2021-02-19 18:52

    This seems to work

    longestprefix<-function(a,b) {
        n <- pmin(nchar(a), nchar(b))
        mapply(function(x, y, n) {
            rr<-rle(x[1:n]==y[1:n])
            if(rr$values[1]) {
                paste(x[1:rr$lengths[1]], collapse="")
            } else {
                ""
            }
        }, strsplit(a, ""), strsplit(b,""), n)
    }
    
    
    
    longestprefix("bestelling", "bestelbon")
    # [1] "bestel"
    longestprefix("bestelling", "stel")
    # [1] ""
    
    0 讨论(0)
  • 2021-02-19 18:55

    Matthew Plourde called, and Mr. Benchmarker responds!
    Sorry, BondedDust, but I can't get to bioconductor from behind workplace walls.

    library(microbenchmark)
    wfoo1 <-'bestelling'
    wfoo2<-'bestelbon'
    
    
    microbenchmark(stu(wfoo1,wfoo2),nathan(wfoo1,wfoo2),plourde(),scriven(wfoo1,wfoo2),dmt(wfoo1,wfoo2),mrflick(wfoo1,wfoo2),roland(c(wfoo1,wfoo2)))
    Unit: microseconds
                        expr     min       lq   median       uq
           stu(wfoo1, wfoo2) 171.905 183.0230 187.5135 191.1490
        nathan(wfoo1, wfoo2)  35.921  42.3360  43.6180  46.1840
                   plourde() 551.208 581.3545 591.6175 602.5220
       scriven(wfoo1, wfoo2)  16.678  21.1680  22.6645  23.7335
           dmt(wfoo1, wfoo2)  79.966  86.1665  88.7325  91.5125
       mrflick(wfoo1, wfoo2) 100.492 108.4030 111.1830 113.9625
     roland(c(wfoo1, wfoo2)) 215.950 226.8545 231.7725 237.5455
         max neval
     435.321   100
      59.012   100
     730.809   100
      85.525   100
     286.081   100
     466.537   100
     291.213   100
    

    I think it's incumbent on me to modify these functions so they measure an input word against, say, a vector of 1000 reference words (rather than just a single pair) to see how that speed test goes. Maybe later.

    Later... :-). I didn't make loops,but I tried it out on long words:

    EDIT: this was, as flodel points out, a typo, which led to testing a rather long vector of very short words!

    wfoo1 <-rep(letters,100)
    wfoo2<-c(rep(letters,99),'foo')
    Unit: microseconds
                        expr        min          lq      median
           stu(wfoo1, wfoo2)  31215.243  32718.5535  35270.6110
        nathan(wfoo1, wfoo2)    202.266    216.3780    227.2825
                   plourde()    569.168    617.0615    661.5340
       scriven(wfoo1, wfoo2)    794.953    828.3070    847.5505
           dmt(wfoo1, wfoo2)   1081.033   1156.9365   1205.8990
       mrflick(wfoo1, wfoo2) 126058.316 131283.4485 241018.5150
     roland(c(wfoo1, wfoo2))    946.759   1004.4885   1045.3260
              uq        max neval
     146451.2595 167000.713   100
        236.0485    356.211   100
        694.6750    795.381   100
        868.9310   1021.594   100
       1307.6740 116075.442   100
     246739.6910 991550.586   100
       1082.1020   1243.103   100
    

    Sorry Richard, but looks like you need to give your chicken dinner to Nathan.

    EDIT2: made sure the inputs were single words, and added flodel's code to the pile.

    Edited the "plourde" function to accept inputs and reran the longword case

    wfoo1 <-paste(rep(letters,100),collapse='')
    wfoo2<-paste(c(rep(letters,99),'foo'),collapse='')
    

    Looks like 3 folks' code perform similarly, so just as in Tour de France, I give the first-place award to mrflick, dmt, and flodel.

     microbenchmark(stu(wfoo1,wfoo2),nathan(wfoo1,wfoo2),plourde(c(wfoo1,wfoo2)),scriven(wfoo1,wfoo2),dmt(wfoo1,wfoo2),mrflick(wfoo1,wfoo2),roland(c(wfoo1,wfoo2)),flodel(wfoo1,wfoo2) )
    Unit: microseconds
                         expr        min          lq     median
            stu(wfoo1, wfoo2)  17786.578  18243.2795  18420.317
         nathan(wfoo1, wfoo2)  36651.195  37703.3625  38095.493
     plourde(c(wfoo1, wfoo2)) 183616.029 187673.5350 190706.457
        scriven(wfoo1, wfoo2)  17546.253  17994.1890  18244.990
            dmt(wfoo1, wfoo2)    737.651    781.0550    821.466
        mrflick(wfoo1, wfoo2)    870.643    951.4630    976.479
      roland(c(wfoo1, wfoo2))  99540.947 102644.2115 103654.258
         flodel(wfoo1, wfoo2)    666.239    705.5795    717.553
             uq         max neval
      18602.270   20835.107   100
      38450.848  155422.375   100
     303856.952 1079715.032   100
      18404.281   18992.905   100
        853.751    1719.047   100
       1012.186  116669.839   100
     105423.123  226522.073   100
        732.947     822.748   100
    
    0 讨论(0)
提交回复
热议问题