Find common substrings between two character variables

☆樱花仙子☆ 提交于 2020-08-13 03:10:14

问题


I have two character variables (names of objects) and I want to extract the largest common substring.

a <- c('blahABCfoo', 'blahDEFfoo')
b <- c('XXABC-123', 'XXDEF-123')

I want the following as a result:

[1] "ABC" "DEF"

These vectors as input should give the same result:

a <- c('textABCxx', 'textDEFxx')
b <- c('zzABCblah', 'zzDEFblah')

These examples are representative. The strings contain identifying elements, and the remainder of the text in each vector element is common, but unknown.

Is there a solution, in one of the following places (in order of preference):

  1. Base R

  2. Recommended Packages

  3. Packages available on CRAN

The answer to the supposed-duplicate does not fulfill these requirements.


回答1:


Here's a CRAN package for that:

library(qualV)

sapply(seq_along(a), function(i)
    paste(LCS(strsplit(a[i], '')[[1]], strsplit(b[i], '')[[1]])$LCS,
          collapse = ""))



回答2:


If you dont mind using bioconductor packages, then, You can use Rlibstree. The installation is pretty straightforward.

source("http://bioconductor.org/biocLite.R")
biocLite("Rlibstree") 

Then, you can do:

require(Rlibstree)
ll <- list(a,b)
lapply(data.frame(do.call(rbind, ll), stringsAsFactors=FALSE), 
           function(x) getLongestCommonSubstring(x))

# $X1
# [1] "ABC"

# $X2
# [1] "DEF"

On a side note: I'm not quite sure if Rlibstree uses libstree 0.42 or libstree 0.43. Both libraries are present in the source package. I remember running into a memory leak (and hence an error) on a huge array in perl that was using libstree 0.42. Just a heads up.




回答3:


Because I have too many things I don't want to do, I did this instead:

Rgames> for(jj in 1:100) {
+ str2<-sample(letters,100,rep=TRUE)
+ str1<-sample(letters,100,rep=TRUE)
+ longs[jj]<-length(lcstring(str1,str2)[[1]])
+ }
Rgames> table(longs)
longs
 2  3  4 
59 39  2

Anyone care to do a statistical estimate of the actual distribution of matching strings? (lcstring is just a brute-force home-rolled function; the output contains all max strings which is why I only look at the first list element)



来源:https://stackoverflow.com/questions/16196327/find-common-substrings-between-two-character-variables

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!