问题
I am trying to do some simple direct linkage with the library('RecordLinkage')
.
So I only have one vector
tv3 = c("TOURDEFRANCE", 'TOURDEFRANCE', "TOURDE FRANCE",
"TOURDE FRANZ", "GET FRESH")
The function that I need is compare.dedup
of the library('RecordLinkage')
and I get :
compare.dedup(as.data.frame(tv3))$pairs
$pairs
id1 id2 tv3 is_match
1 1 2 1 NA
2 1 3 0 NA
3 1 4 0 NA
4 1 5 0 NA
5 2 3 0 NA
....
I have trouble finding documentation for the next step. How do I then compare and find my similar pair ?
So I found the distance jarowinkler()
but it returns only pairs. Basically, you can only do jarowinkler(tv3[1], tv3)
one by one.
So I am asking : do you need to do your own loop to get your result or is there a more direct way from the compare.dedup
function ?
mat = matrix(0, length(tv3), length(tv3))
for(j in 1:length(tv3)){
for(i in 1:length(tv3)){
{ mat[i,j] = jarowinkler(tv3[j], tv3[i]) }
}
}
The dissimilarity matrix
> mat
[,1] [,2] [,3] [,4] [,5]
[1,] 1.0000000 1.0000000 0.9846154 0.9333333 0.5240741
[2,] 1.0000000 1.0000000 0.9846154 0.9333333 0.5240741
[3,] 0.9846154 0.9846154 1.0000000 0.9525641 0.5133903
[4,] 0.9333333 0.9333333 0.9525641 1.0000000 0.5240741
[5,] 0.5240741 0.5240741 0.5133903 0.5240741 1.0000000
What I want to do is simply attribute for similar object ("TOURDEFRANCE", 'TOURDEFRANCE', "TOURDE FRANCE", "TOURDE FRANZ"
), one of the possible similar object names.
How could I set a cut-off, let's say 0.90
, on my dissimilarity matrix and then retreive
all the rows of the similar object ?
If my data are in a dataframe
tv3
1 TOURDEFRANCE
2 TOURDEFRANCE
3 TOURDE FRANCE
4 TOURDE FRANZ
5 GET FRESH
Do something like which
cut-off > 0.90
and retreive the corresponding rows ?
Any help for this simple Record Linkage is very welcome !
回答1:
Taken from this post, here's an example that should work for you:
tv3 = as.data.frame(c("TOURDEFRANCE", 'TOURDEFRANCE', "TOURDE FRANCE",
"TOURDE FRANZ", "GET FRESH"))
colnames(tv3) <- "name"
tv3 %>% compare.dedup(strcmp = TRUE) %>%
epiWeights() %>%
epiClassify(0.5) %>%
getPairs(show = "links", single.rows = TRUE) -> matches
In result, the matches
dataframe should help you determining thresholds (set in epiClassify()
).
来源:https://stackoverflow.com/questions/32959257/r-simple-record-linkage-the-next-step