r which rows have longest partial string match between two vectors

纵然是瞬间 提交于 2019-12-06 17:40:42

Using the tm and slam packages, this is a less naive approach that incorporates text-processing techniques:

## load the requisite libraries
library(tm)
library(slam)

First, create a corpus from the combined towns and water vectors. We are eventually going to calculate the distance between every town and every body of water based on the text.

corpus <- Corpus(VectorSource((c(towns, water))))

Here, I do some standard preprocessing by removing punctuation and stemming the "documents". Stemming finds the common underlying parts of words. For example, city and cities have the same stem: citi

corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, stemDocument)

A standard Term Document Matrix has binary indicators for which words are in which documents. We want to encode additional information about how frequent the word is in the entire corpus as well. For example, we don't care how often "the" appears in a document because it is incredibly common.

tdm <- weightTfIdf(TermDocumentMatrix(corpus))

Lastly, we calculate the cosine distance between every document. The tm package creates sparse matrices which are usually very memory efficient. The slam package has matrix math functions for sparse matrices.

cosine_dist <- function(tdm) {
  crossprod_simple_triplet_matrix(tdm)/(sqrt(col_sums(tdm^2) %*% t(col_sums(tdm^2))))
}

d <- cosine_dist(tdm)
> d
    Docs
Docs          1           2           3           4          5         6           7           8
   1 1.00000000 0.034622992 0.038063800 0.044272011 0.00000000 0.0000000 0.000000000 0.260626250
   2 0.03462299 1.000000000 0.055616255 0.064687275 0.01751883 0.0000000 0.146145917 0.006994714
   3 0.03806380 0.055616255 1.000000000 0.071115850 0.01925984 0.0000000 0.006633427 0.007689843
   4 0.04427201 0.064687275 0.071115850 1.000000000 0.54258275 0.0000000 0.007715340 0.008944058
   5 0.00000000 0.017518827 0.019259836 0.542582752 1.00000000 0.0000000 0.014219656 0.016484228
   6 0.00000000 0.000000000 0.000000000 0.000000000 0.00000000 1.0000000 0.121137618 0.000000000
   7 0.00000000 0.146145917 0.006633427 0.007715340 0.01421966 0.1211376 1.000000000 0.005677459
   8 0.26062625 0.006994714 0.007689843 0.008944058 0.01648423 0.0000000 0.005677459 1.000000000

Now we have a matrix of similarity scores between all of the towns and water bodies in the same matrix. We only care about the distances for half of this matrix, though. Hence the indexing notation in the apply function below:

best.match <- apply(d[5:8,1:4], 1, function(row) if(all(row == 0)) NA else which.max(row))

And here's the output:

> cbind(water, towns[best.match])
     water                                                                                       
[1,] "Alturas City of"                                  "Alturas city, Modoc County"             
[2,] "Casitas Municipal Water District"                 NA                                       
[3,] "California Water Service Company Bellflower City" "Bellflower city, Los Angeles County"    
[4,] "Contra Costa City of Public Works"                "Acalanes Ridge CDP, Contra Costa County"

Notice the NA value. NA is returned when there isn't a single word match between a body of water and all of the towns.

Another possible way to do it using just base R. We split the strings from water using strsplit thus creating a list, and we check to see which of those strings are found in towns using grepl. We now have a list of 4 logical matrices. By applying rowSums, we get the sum of 'TRUE' for each row. We use which.max to identify the row with most 'TRUE' values. Finally, we use those values for indexing towns.

lst <- lapply(strsplit(water, ' '), function(i)
                       sapply(tolower(i), function(j)
                                 grepl(j, tolower(towns))))

ind <- unlist(as.numeric(lapply(lst, function(i)
                   which.max(rowSums(i)[!is.na(match(TRUE, i))]))))

cbind(water, towns[ind])
#            water                                                                                       
#[1,] "Alturas City of"                                  "Alturas city, Modoc County"             
#[2,] "Casitas Municipal Water District"                 NA                                       
#[3,] "California Water Service Company Bellflower City" "Bellflower city, Los Angeles County"    
#[4,] "Contra Costa City of Public Works"                "Acalanes Ridge CDP, Contra Costa County"

Side Note: I used [!is.na(match(TRUE, i))] to only calculate the rowSums when there are indeed 'TRUE' values in the matrix. Otherwise the rowSums of a 4 x 4 logical matrix with all 'FALSE' is 0, 0, 0, 0, and taking which.max(c(0, 0, 0, 0)) gives 1, which is quite interesting.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!