问题
I have two vectors, each of which includes a series of strings. For example,
V1=c("pen", "document folder", "warn")
V2=c("pens", "copy folder", "warning")
I need to find which two are matched the best. I directly use levenshtein distance. But it is not good enough. In my case, pen and pens should mean the same. document folder and copy folder are probably the same thing. warn and warning are actually the same. I am trying to use the packages like tm. But I am not very sure which functions are suitable for doing this. Can anyone tell me about this?
回答1:
In my experience the cosine match is a good one for such kind of a jobs:
V1 <- c("pen", "document folder", "warn")
V2 <- c("copy folder", "warning", "pens")
result <- sapply(V1, function(x) stringdist(x, V2, method = 'cosine', q = 1))
rownames(result) <- V2
result
pen document folder warn
copy folder 0.6797437 0.2132042 0.8613250
warning 0.6150998 0.7817821 0.1666667
pens 0.1339746 0.6726732 0.7500000
You have to define a cut off when the distance is close enough, how lower the distance how better they match. You can also play with the Q parameter which says how many letters combinations should be compared to each other. For example:
result <- sapply(V1, function(x) stringdist(x, V2, method = 'cosine', q = 3))
rownames(result) <- V2
result
pen document folder warn
copy folder 1.0000000 0.5377498 1.0000000
warning 1.0000000 1.0000000 0.3675445
pens 0.2928932 1.0000000 1.0000000
回答2:
Here's wiki for Levenshtein distance. It measures how many delete/change/insert actions need to be taken to transform strings. And one of approaches for fuzzy matching is minimizing this value.
Here's an example. I shuffled up order a bit, to make it less boring:
V1 <- c("pen", "document folder", "warn")
V2 <- c("copy folder", "warning", "pens")
apply(adist(x = V1, y = V2), 1, which.min)
[1] 3 1 2
Output means, which positions of V2 correspond to closest transformation of V1, in order of V1.
data.frame(string_to_match = V1,
closest_match = V2[apply(adist(x = V1, y = V2), 1, which.min)])
string_to_match closest_match
1 pen pens
2 document folder copy folder
3 warn warning
来源:https://stackoverflow.com/questions/40299192/fuzzy-matching-two-strings-uring-r