问题
I have a data frame with names, surnames, birthdays and some random variables. Lets say it looks like this:
BIRTH NAME SURNAME random_value
1 1 Luke Skywalker 1
2 1 Luke Skywalker 2
4 2 Leia Organa 3
5 3 Han Solo 7
7 1 Ben Solo 1
8 5 Lando Calrissian 3
9 3 Han Solo 4
10 3 Ham Solo 4
11 1 Luke Wkywalker 9
How can I figure out, if there is a typo in name or surname, based on BIRTH
,NAME
and SURNAME
, and then replace the typo with the correct name or surname?
For example, we see, that there are two Han Solo
s with birthdays on 3
and then there is a Ham Solo
with the same birthdate. What I would like this algorithm to do is figure out that Ham
is wrong and replace it with Han
.
If there are two different spellings which have equal number of occurrences (for same BIRTH
), it doesn't really matter, which one is chosen, as long that all the NAME
or SURNAME
for this group is the same (so always Ham
or Han
but not mixed for the same BIRTH
).
So the end result would be this:
BIRTH NAME SURNAME random_value
1 1 Luke Skywalker 1
2 1 Luke Skywalker 2
4 2 Leia Organa 3
5 3 Han Solo 7
7 1 Ben Solo 1
8 5 Lando Calrissian 3
11 3 Han Solo 4
12 3 Han Solo 4
13 1 Luke Skywalker 9
Is there any automated way to do this? My data set is large (>3mill rows) and it would be impossible to check manually.
I would imagine that we look for all the names and surnames with the same birth and then check, if there are some singular outliers that differ only by a letter or that the order of the letters is switched (Luke
vs Lkue
). When we find an outlier like that, we replace it.
回答1:
Here is one way to find the typos. First, define the data frame you mention in the question:
my_df<-data.frame(BIRTH = c(1,1,2,3,1,5,3,3,1),
NAME = c("Luke","Luke","Leia","Han","Ben","Lando","Han","Ham","Luke"),
SURNAME = c("Skywalker","Skywalker","Organa","Solo","Solo","Calrissian","Solo","Solo","Wkywalker"),
random_value = c(1,2,3,7,1,3,4,4,9))
Second, make a new column combining all the entries you want to match on:
my_df$birth_and_names <- do.call(paste, c(my_df[c("BIRTH", "NAME", "SURNAME")], sep = " "))
Third, define a distance matrix based upon string distance, using the package stringdist:
library(stringdist)
dist.matrix<-stringdistmatrix(my_df$birth_and_names,my_df$birth_and_names,method='jw',p=0.1)
row.names(dist.matrix)<-my_df$birth_and_names
names(dist.matrix)<-my_df$birth_and_names
dist.matrix<-as.dist(dist.matrix)
Fourth, cluster and display the results as a dendrogram.
clusts<-hclust(dist.matrix,method="ward.D2")
plot(clusts)
See the dendrogram here:
Now where exactly you want to set your parameters for automatically combining similar results is of course up to you, and depends upon the problem. There are the usual trade-offs between false positives and false negatives.
For this example, cutting at a distance of 0.2 seems appropriate, so:
my_df$LikelyGroup<-cutree(clusts,h=0.2)
where now my_df$LikelyGroup
is the column of identifiers which has one number per individual, even if they are miss-spelled.
Now to name the groups, find the mode for each name/birthday column:
library(dplyr)
Mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
my_df<-my_df%>%
group_by(LikelyGroup)%>%
mutate(Group_Birth=Mode(BIRTH),
Group_Name=Mode(NAME),
Group_Surname=Mode(SURNAME))
Output my_df:
BIRTH|NAME |SURNAME | random_value| LikelyGroup| Group_Birth|Group_Name |Group_Surname
------|-----|----------|-------------|------------|------------|-----------|--------------
1|Luke |Skywalker | 1| 1| 1|Luke |Skywalker
1|Luke |Skywalker | 2| 1| 1|Luke |Skywalker
2|Leia |Organa | 3| 2| 2|Leia |Organa
3|Han |Solo | 7| 3| 3|Han |Solo
1|Ben |Solo | 1| 4| 1|Ben |Solo
5|Lando|Calrissian| 3| 5| 5|Lando |Calrissian
3|Han |Solo | 4| 3| 3|Han |Solo
3|Ham |Solo | 4| 3| 3|Han |Solo
1|Luke |Wkywalker | 9| 1| 1|Luke |Skywalker
See gist at https://gist.github.com/gdmcdonald/9135ec8f7e903a0735a0b16d8cb97297
来源:https://stackoverflow.com/questions/45990947/how-to-find-a-typo-in-a-data-frame-and-replace-it