longest common substring in R finding non-contiguous matches between the two strings

后端 未结 4 792
挽巷
挽巷 2020-11-29 07:21

I have a question regarding finding the longest common substring in R. While searching through a few posts on StackOverflow, I got to know about the qualV package. However,

相关标签:
4条回答
  • 2020-11-29 08:00

    I'm not sure what you did to get your output of "hello". Based on trial-and-error experiments below, it appears that the LCS function will (a) not regard a string as an LCS if a character follows what would otherwise be an LCS; (b) find multiple, equally-long LCS's (unlike sub() that finds just the first); (c) the order of the elements in the strings doesn't matter -- which has no illustration below; and (b) the order of the string in the LCS call doesn't matter -- also not shown.

    So, your "hello" of a had no LCS in b since the "hel" of b was followed by a character. Well, that's my current hypothesis.

    Point A above:

    a= c("hello", "hel", "abcd")
    b= c("hello123l5678o", "abcd") 
    print(LCS(a, b)[4]) # "abcd" - perhaps because it has nothing afterwards, unlike hello123...
    
    a= c("hello", "hel", "abcd1") # added 1 to abcd
    b= c("hello123l5678o", "abcd") 
    print(LCS(a, b)[4]) # no LCS!, as if anything beyond an otherwise LCS invalidates it
    
    a= c("hello", "hel", "abcd") 
    b= c("hello1", "abcd") # added 1 to hello
    print(LCS(a, b)[4]) # abcd only, since the b hello1 has a character
    

    Point B above:

    a= c("hello", "hel", "abcd") 
    b= c("hello", "abcd") 
    print(LCS(a, b)[4]) # found both, so not like sub vs gsub of finding first or all
    
    0 讨论(0)
  • 2020-11-29 08:15
    df <- data.frame(A. = c("Australia", "Network"),
                     B. = c("Austria", "Netconnect"), stringsAsFactors = FALSE)
    
     auxFun <- function(x) {
    
       a <- strsplit(x[[1]], "")[[1]]
       b  <- strsplit(x[[2]], "")[[1]]
       lastchar <- suppressWarnings(which(!(a == b)))[1] - 1
    
       if(lastchar > 0){
         out <- paste0(a[1:lastchar], collapse = "")
       } else {
         out <- ""
       }
    
       return(out)
     }
    
     df$C. <- apply(df, 1, auxFun)
    
     df
     A.         B.    C.
     1 Australia    Austria Austr
     2   Network Netconnect   Net
    
    0 讨论(0)
  • 2020-11-29 08:16

    Here are three possible solutions.

    library(stringi)
    library(stringdist)
    
    a <- "hello"
    b <- "hel123l5678o"
    
    ## get all forward substrings of 'b'
    sb <- stri_sub(b, 1, 1:nchar(b))
    ## extract them from 'a' if they exist
    sstr <- na.omit(stri_extract_all_coll(a, sb, simplify=TRUE))
    ## match the longest one
    sstr[which.max(nchar(sstr))]
    # [1] "hel"
    

    There are also adist() and agrep() in base R, and the stringdist package has a few functions that run the LCS method. Here's a look at stringsidt. It returns the number of unpaired characters.

    stringdist(a, b, method="lcs")
    # [1] 7
    
    Filter("!", mapply(
        stringdist, 
        stri_sub(b, 1, 1:nchar(b)),
        stri_sub(a, 1, 1:nchar(b)),
        MoreArgs = list(method = "lcs")
    ))
    #  h  he hel 
    #  0   0   0 
    

    Now that I've explored this a bit more, I think adist() might be the way to go. If we set counts=TRUE we get a sequence of Matches, Insertions, etc. So if you give that to stri_locate() we can use that matrix to get the matches from a to b.

    ta <- drop(attr(adist(a, b, counts=TRUE), "trafos")))
    # [1] "MMMIIIMIIIIM"
    

    So the M values denote straight across matches. We can go and get the substrings with stri_sub()

    stri_sub(b, stri_locate_all_regex(ta, "M+")[[1]])
    # [1] "hel" "l"   "o" 
    

    Sorry I haven't explained that very well as I'm not well versed in string distance algorithms.

    0 讨论(0)
  • 2020-11-29 08:17

    Leveraging @RichardScriven's insight that adist could be used, but this function combines it all,

    EDIT This was tricky because we needed to get the longest_string in two contexts, so I made this function:

    longest_string <- function(s){return(s[which.max(nchar(s))])}
    

    This combines @RichardSriven's work using the library...

    library(stringi)
    library(stringdist)
    lcsbstr <- function(a,b) { 
      sbstr_locations<- stri_locate_all_regex(drop(attr(adist(a, b, counts=TRUE), "trafos")), "M+")[[1]]
      cmn_sbstr<-stri_sub(longest_string(c(a,b)), sbstr_locations)
      longest_cmn_sbstr <- longest_string(cmn_sbstr)
       return(longest_cmn_sbstr) 
    }
    

    We can rewrite it to avoid the use of any external libraries (but still using the adist)...

    lcsbstr_no_lib <- function(a,b) { 
        matches <- gregexpr("M+", drop(attr(adist(a, b, counts=TRUE), "trafos")))[[1]];
        lengths<- attr(matches, 'match.length')
        which_longest <- which.max(lengths)
        index_longest <- matches[which_longest]
        length_longest <- lengths[which_longest]
        longest_cmn_sbstr  <- substring(longest_string(c(a,b)), index_longest , index_longest + length_longest - 1)
        return(longest_cmn_sbstr ) 
    }
    

    All of identify only 'hello ' as the longest common substring, instead of 'hello r':

    identical('hello ', 
        lcsbstr_no_lib('hello world', 'hello there'), 
        lcsbstr(       'hello world', 'hello there'))
    

    EDIT And since the edit, works regardless of which argument is the longer of the two:

    identical('hello',
        lcsbstr_no_lib('hello', 'hello there'), 
        lcsbstr(       'hello', 'hello there'),
        lcsbstr_no_lib('hello there', 'hello'), 
        lcsbstr(       'hello there', 'hello'))
    

    LAST EDIT But this is only good if you accept this behavior. Notice this result:

    lcsbstr('hello world', 'hello')
    #[1] 'hell'
    

    I was expecting 'hello', but since the transformation actually moves (via deletion) the world to become the hello, so only the hell part is considered a match according to the M:

    drop(attr(adist('hello world', 'hello', counts=TRUE), "trafos"))
    #[1] "MMMMDDDMDDD"
    #[1]  vvvv   v
    #[1] "hello world"
    

    This behavior is observed using [this Levenstein tool] -- it gives two possible solutions, equivalent to these two transforms; can we tell adist which one we prefer? (the one with the greater number of consecutive M)

    #[1] "MMMMDDDMDDD"
    #[1] "MMMMMDDDDDD"
    

    Finally, don't forget adist allows you to pass in ignore.case = TRUE (FALSE is the default)

    0 讨论(0)
提交回复
热议问题