R: find largest common substring starting at the beginning

后端 未结 11 2699
星月不相逢
星月不相逢 2021-02-19 18:33

I\'ve got 2 vectors:

word1 <- \"bestelling\"   
word2 <- \"bestelbon\"

Now I want to find the largest common substring that starts at the

相关标签:
11条回答
  • 2021-02-19 18:56

    why not add another! and hack at it so the answer is different than everyone elses!

    largestStartSubstr<-function(word1, word2){ 
        word1vec<-unlist(strsplit(word1, "", fixed=TRUE))
        word2vec<-unlist(strsplit(word2, "", fixed=TRUE))
        indexes<-intersect(1:nchar(word1), 1:nchar(word2))
        bools<-word1vec[indexes]==word2vec[indexes]
        if(bools[1]==FALSE){
            ""
        }else{
            lastChar<-match(1,c(0,diff(cumsum(!bools))))-1
            if(is.na(lastChar)){
                lastChar<-indexes[length(indexes)]
            }
            substr(word1, 1,lastChar)
        }
    }
    
    word1 <- "bestselling"
    word2<- "bestsel"
    
    largestStartSubstr(word1, word2)
    #[1] "bestsel"
    
    word1 <- "bestselling"
    word2<- "sel"
    
    largestStartSubstr(word1, word2)
    #[1] ""
    
    0 讨论(0)
  • 2021-02-19 18:57
    fun <- function(words) {
      #extract substrings from length 1 to length of shortest word
      subs <- sapply(seq_len(min(nchar(words))), 
                     function(x, words) substring(words, 1, x), 
                     words=words)
      #max length for which substrings are equal
      neqal <- max(cumsum(apply(subs, 2, function(x) length(unique(x)) == 1L)))
      #return substring
      substring(words[1], 1, neqal)
    }
    
    words1 <- c("bestelling", "bestelbon")
    fun(words1)
    #[1] "bestel"
    
    words2 <- c("bestelling", "stel")
    fun(words2)
    #[1] ""
    
    0 讨论(0)
  • 2021-02-19 19:03

    A little messy, but it's what I came up with:

    largest_subset <- Vectorize(function(word1,word2) {
        substr(word1, 1, sum(substring(word1, 1, 1:nchar(word1))==substring(word2, 1, 1:nchar(word2))))
    })
    

    It produces a warning message if the words are not the same length, but have no fear. It checks to see if each substring from the first character of each word to every position produces a match between the two words. You can then count how many values came out to be true, and capture the substring up to that character. I vectorized it so you can apply it to word vectors.

    > word1 <- c("tester","doesitwork","yupyppp","blanks")
    > word2 <- c("testover","doesit","yupsuredoes","")
    > largest_subset(word1,word2)
        tester doesitwork    yupyppp     blanks 
        "test"   "doesit"      "yup"         "" 
    
    0 讨论(0)
  • 2021-02-19 19:07

    This will work for an arbitrary vector of words

    words <- c('bestelling', 'bestelbon')
    words.split <- strsplit(words, '')
    words.split <- lapply(words.split, `length<-`, max(nchar(words)))
    words.mat <- do.call(rbind, words.split)
    common.substr.length <- which.max(apply(words.mat, 2, function(col) !length(unique(col)) == 1)) - 1
    substr(words[1], 1, common.substr.length)
    # [1] "bestel"
    
    0 讨论(0)
  • 2021-02-19 19:11

    Here's another function that seems to work.

    foo <- function(word1, word2) {
        s1 <- substring(word1, 1, 1:nchar(word1))
        s2 <- substring(word2, 1, 1:nchar(word2))
        if(length(w <- which(s1 %in% s2))) s2[max(w)] else character(1)
    }
    
    foo("bestelling", "bestelbon")
    # [1] "bestel"
    foo("bestelling", "stel")
    # [1] ""
    foo("bestelbon", "bestieboop")
    # [1] "best"
    foo("stel", "steal")
    # [1] "ste"
    
    0 讨论(0)
提交回复
热议问题