r which rows have longest partial string match between two vectors

I have two vectors that contain the names of towns, both of which are in different formats, and I need to match the names of water districts (water) to their respective census data (towns). Essentially for each row in water, I need to know the best match in towns, since most of them contain similar words such as city. One other problem I see is that words are capitalized in one data set and are not capitalized in another. Here is my example data:

towns= c("Acalanes Ridge CDP, Contra Costa County", "Bellflower city, Los Angeles County", "Arvin city, Kern County", "Alturas city, Modoc County")

water=c("Alturas City of","Casitas Municipal Water District","California Water Service Company Bellflower City", "Contra Costa City of Public Works")

Using the tm and slam packages, this is a less naive approach that incorporates text-processing techniques:

## load the requisite libraries
library(tm)
library(slam)

First, create a corpus from the combined towns and water vectors. We are eventually going to calculate the distance between every town and every body of water based on the text.

corpus <- Corpus(VectorSource((c(towns, water))))

Here, I do some standard preprocessing by removing punctuation and stemming the "documents". Stemming finds the common underlying parts of words. For example, city and cities have the same stem: citi

corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, stemDocument)

A standard Term Document Matrix has binary indicators for which words are in which documents. We want to encode additional information about how frequent the word is in the entire corpus as well. For example, we don't care how often "the" appears in a document because it is incredibly common.

tdm <- weightTfIdf(TermDocumentMatrix(corpus))

Lastly, we calculate the cosine distance between every document. The tm package creates sparse matrices which are usually very memory efficient. The slam package has matrix math functions for sparse matrices.

cosine_dist <- function(tdm) {
  crossprod_simple_triplet_matrix(tdm)/(sqrt(col_sums(tdm^2) %*% t(col_sums(tdm^2))))
}

d <- cosine_dist(tdm)
> d
    Docs
Docs          1           2           3           4          5         6           7           8
   1 1.00000000 0.034622992 0.038063800 0.044272011 0.00000000 0.0000000 0.000000000 0.260626250
   2 0.03462299 1.000000000 0.055616255 0.064687275 0.01751883 0.0000000 0.146145917 0.006994714
   3 0.03806380 0.055616255 1.000000000 0.071115850 0.01925984 0.0000000 0.006633427 0.007689843
   4 0.04427201 0.064687275 0.071115850 1.000000000 0.54258275 0.0000000 0.007715340 0.008944058
   5 0.00000000 0.017518827 0.019259836 0.542582752 1.00000000 0.0000000 0.014219656 0.016484228
   6 0.00000000 0.000000000 0.000000000 0.000000000 0.00000000 1.0000000 0.121137618 0.000000000
   7 0.00000000 0.146145917 0.006633427 0.007715340 0.01421966 0.1211376 1.000000000 0.005677459
   8 0.26062625 0.006994714 0.007689843 0.008944058 0.01648423 0.0000000 0.005677459 1.000000000

Now we have a matrix of similarity scores between all of the towns and water bodies in the same matrix. We only care about the distances for half of this matrix, though. Hence the indexing notation in the apply function below:

best.match <- apply(d[5:8,1:4], 1, function(row) if(all(row == 0)) NA else which.max(row))

And here's the output:

> cbind(water, towns[best.match])
     water                                                                                       
[1,] "Alturas City of"                                  "Alturas city, Modoc County"             
[2,] "Casitas Municipal Water District"                 NA                                       
[3,] "California Water Service Company Bellflower City" "Bellflower city, Los Angeles County"    
[4,] "Contra Costa City of Public Works"                "Acalanes Ridge CDP, Contra Costa County"

Notice the NA value. NA is returned when there isn't a single word match between a body of water and all of the towns.

Another possible way to do it using just base R. We split the strings from water using strsplit thus creating a list, and we check to see which of those strings are found in towns using grepl. We now have a list of 4 logical matrices. By applying rowSums, we get the sum of 'TRUE' for each row. We use which.max to identify the row with most 'TRUE' values. Finally, we use those values for indexing towns.

lst <- lapply(strsplit(water, ' '), function(i)
                       sapply(tolower(i), function(j)
                                 grepl(j, tolower(towns))))

ind <- unlist(as.numeric(lapply(lst, function(i)
                   which.max(rowSums(i)[!is.na(match(TRUE, i))]))))

cbind(water, towns[ind])
#            water                                                                                       
#[1,] "Alturas City of"                                  "Alturas city, Modoc County"             
#[2,] "Casitas Municipal Water District"                 NA                                       
#[3,] "California Water Service Company Bellflower City" "Bellflower city, Los Angeles County"    
#[4,] "Contra Costa City of Public Works"                "Acalanes Ridge CDP, Contra Costa County"

Side Note: I used [!is.na(match(TRUE, i))] to only calculate the rowSums when there are indeed 'TRUE' values in the matrix. Otherwise the rowSums of a 4 x 4 logical matrix with all 'FALSE' is 0, 0, 0, 0, and taking which.max(c(0, 0, 0, 0)) gives 1, which is quite interesting.

来源：https://stackoverflow.com/questions/36921346/r-which-rows-have-longest-partial-string-match-between-two-vectors

标签

longest-substring