I\'ve got 2 vectors:
word1 <- \"bestelling\"
word2 <- \"bestelbon\"
Now I want to find the largest common substring that starts at the
A bit of regex can do this:
sub('^([^|]*)[^|]*(?:\\|\\1[^|]*)$', '\\1', paste0(word1, '|', word2))
#[1] "bestel"
I used |
as a separator - pick one that makes sense for your strings.
flodel <- function(word1, word2) {
# the length of the shorter word
n <- min(nchar(word1), nchar(word2))
# two vectors of characters of the same length n
c1 <- strsplit(word1, "", fixed = TRUE)[[1]][1:n]
c2 <- strsplit(word2, "", fixed = TRUE)[[1]][1:n]
# a vector that is TRUE as long as the characters match
m <- as.logical(cumprod(c1 == c2))
# the answer
paste(c1[m], collapse = "")
}
I realize I'm coming late to this party but determining pairwise alignment is a fundamental problem in biological research and there is already a package (or a package-family) that attacks this problem. The Bioconductor package named Biostrings is available (and it is big at least if you install all the default dependencies, so patience is needed in the install process). It returns S4 objects so different extraction functions are needed. This is perhaps a sledgehammer to extract a nut, but here's the code to give the desired result:
install.packages("Biostrings", repo="http://www.bioconductor.org/packages/2.14/bioc/", dependencies=TRUE)
library(Biostrings)
psa1 <- pairwiseAlignment(pattern = c(word1) ,word2,type="local")
psa1@pattern
#[1] bestel
However, it is not set up to default to restriction of the match to alignment at the first character for both strings. We can hope @MartinMorgan will come along a fix my errors.
As much as I generally avoid a for loop in R - given you're starting at the beginning and continuing until you find the solution it seemed an easy approach.
It's a bit more intuitive than some of the other examples I think
lcsB <- function(string1, string2) {
x <- ''
for (i in 1:nchar(string1)){
if (substr(string1[1],1,i) == substr(string2[1],1,i)) {
x <- substr(string1[1],1,i)
}
else
return(x)
}
return(x)
}
lcsB("bestelling", "bestelbon")
lcsB("bestelling", "stel")
This seems to work
longestprefix<-function(a,b) {
n <- pmin(nchar(a), nchar(b))
mapply(function(x, y, n) {
rr<-rle(x[1:n]==y[1:n])
if(rr$values[1]) {
paste(x[1:rr$lengths[1]], collapse="")
} else {
""
}
}, strsplit(a, ""), strsplit(b,""), n)
}
longestprefix("bestelling", "bestelbon")
# [1] "bestel"
longestprefix("bestelling", "stel")
# [1] ""
Matthew Plourde called, and Mr. Benchmarker responds!
Sorry, BondedDust, but I can't get to bioconductor from behind workplace walls.
library(microbenchmark)
wfoo1 <-'bestelling'
wfoo2<-'bestelbon'
microbenchmark(stu(wfoo1,wfoo2),nathan(wfoo1,wfoo2),plourde(),scriven(wfoo1,wfoo2),dmt(wfoo1,wfoo2),mrflick(wfoo1,wfoo2),roland(c(wfoo1,wfoo2)))
Unit: microseconds
expr min lq median uq
stu(wfoo1, wfoo2) 171.905 183.0230 187.5135 191.1490
nathan(wfoo1, wfoo2) 35.921 42.3360 43.6180 46.1840
plourde() 551.208 581.3545 591.6175 602.5220
scriven(wfoo1, wfoo2) 16.678 21.1680 22.6645 23.7335
dmt(wfoo1, wfoo2) 79.966 86.1665 88.7325 91.5125
mrflick(wfoo1, wfoo2) 100.492 108.4030 111.1830 113.9625
roland(c(wfoo1, wfoo2)) 215.950 226.8545 231.7725 237.5455
max neval
435.321 100
59.012 100
730.809 100
85.525 100
286.081 100
466.537 100
291.213 100
I think it's incumbent on me to modify these functions so they measure an input word against, say, a vector of 1000 reference words (rather than just a single pair) to see how that speed test goes. Maybe later.
Later... :-). I didn't make loops,but I tried it out on long words:
EDIT: this was, as flodel points out, a typo, which led to testing a rather long vector of very short words!
wfoo1 <-rep(letters,100)
wfoo2<-c(rep(letters,99),'foo')
Unit: microseconds
expr min lq median
stu(wfoo1, wfoo2) 31215.243 32718.5535 35270.6110
nathan(wfoo1, wfoo2) 202.266 216.3780 227.2825
plourde() 569.168 617.0615 661.5340
scriven(wfoo1, wfoo2) 794.953 828.3070 847.5505
dmt(wfoo1, wfoo2) 1081.033 1156.9365 1205.8990
mrflick(wfoo1, wfoo2) 126058.316 131283.4485 241018.5150
roland(c(wfoo1, wfoo2)) 946.759 1004.4885 1045.3260
uq max neval
146451.2595 167000.713 100
236.0485 356.211 100
694.6750 795.381 100
868.9310 1021.594 100
1307.6740 116075.442 100
246739.6910 991550.586 100
1082.1020 1243.103 100
Sorry Richard, but looks like you need to give your chicken dinner to Nathan.
EDIT2: made sure the inputs were single words, and added flodel's code to the pile.
wfoo1 <-paste(rep(letters,100),collapse='')
wfoo2<-paste(c(rep(letters,99),'foo'),collapse='')
Looks like 3 folks' code perform similarly, so just as in Tour de France, I give the first-place award to mrflick, dmt, and flodel.
microbenchmark(stu(wfoo1,wfoo2),nathan(wfoo1,wfoo2),plourde(c(wfoo1,wfoo2)),scriven(wfoo1,wfoo2),dmt(wfoo1,wfoo2),mrflick(wfoo1,wfoo2),roland(c(wfoo1,wfoo2)),flodel(wfoo1,wfoo2) )
Unit: microseconds
expr min lq median
stu(wfoo1, wfoo2) 17786.578 18243.2795 18420.317
nathan(wfoo1, wfoo2) 36651.195 37703.3625 38095.493
plourde(c(wfoo1, wfoo2)) 183616.029 187673.5350 190706.457
scriven(wfoo1, wfoo2) 17546.253 17994.1890 18244.990
dmt(wfoo1, wfoo2) 737.651 781.0550 821.466
mrflick(wfoo1, wfoo2) 870.643 951.4630 976.479
roland(c(wfoo1, wfoo2)) 99540.947 102644.2115 103654.258
flodel(wfoo1, wfoo2) 666.239 705.5795 717.553
uq max neval
18602.270 20835.107 100
38450.848 155422.375 100
303856.952 1079715.032 100
18404.281 18992.905 100
853.751 1719.047 100
1012.186 116669.839 100
105423.123 226522.073 100
732.947 822.748 100