How to find a typo in a data frame and replace it

问题

I have a data frame with names, surnames, birthdays and some random variables. Lets say it looks like this:

    BIRTH  NAME    SURNAME random_value
 1      1  Luke  Skywalker            1
 2      1  Luke  Skywalker            2
 4      2  Leia     Organa            3
 5      3   Han       Solo            7
 7      1   Ben       Solo            1
 8      5 Lando Calrissian            3
 9      3   Han       Solo            4
 10     3   Ham       Solo            4
 11     1  Luke  Wkywalker            9

How can I figure out, if there is a typo in name or surname, based on BIRTH,NAMEand SURNAME, and then replace the typo with the correct name or surname?

For example, we see, that there are two Han Solos with birthdays on 3and then there is a Ham Solo with the same birthdate. What I would like this algorithm to do is figure out that Hamis wrong and replace it with Han.

If there are two different spellings which have equal number of occurrences (for same BIRTH), it doesn't really matter, which one is chosen, as long that all the NAMEor SURNAMEfor this group is the same (so always Hamor Hanbut not mixed for the same BIRTH).

So the end result would be this:

BIRTH   NAME          SURNAME random_value
    1      1  Luke  Skywalker            1
    2      1  Luke  Skywalker            2
    4      2  Leia     Organa            3
    5      3   Han       Solo            7
    7      1   Ben       Solo            1
    8      5 Lando Calrissian            3
    11     3   Han       Solo            4
    12     3   Han       Solo            4
    13     1  Luke  Skywalker            9

Is there any automated way to do this? My data set is large (>3mill rows) and it would be impossible to check manually.

I would imagine that we look for all the names and surnames with the same birth and then check, if there are some singular outliers that differ only by a letter or that the order of the letters is switched (Lukevs Lkue). When we find an outlier like that, we replace it.

回答1:

Here is one way to find the typos. First, define the data frame you mention in the question:

my_df<-data.frame(BIRTH = c(1,1,2,3,1,5,3,3,1),
       NAME = c("Luke","Luke","Leia","Han","Ben","Lando","Han","Ham","Luke"),
       SURNAME = c("Skywalker","Skywalker","Organa","Solo","Solo","Calrissian","Solo","Solo","Wkywalker"),
       random_value = c(1,2,3,7,1,3,4,4,9))

Second, make a new column combining all the entries you want to match on:

my_df$birth_and_names <- do.call(paste, c(my_df[c("BIRTH", "NAME", "SURNAME")], sep = " "))

Third, define a distance matrix based upon string distance, using the package stringdist:

library(stringdist)
dist.matrix<-stringdistmatrix(my_df$birth_and_names,my_df$birth_and_names,method='jw',p=0.1)
row.names(dist.matrix)<-my_df$birth_and_names
names(dist.matrix)<-my_df$birth_and_names
dist.matrix<-as.dist(dist.matrix)

Fourth, cluster and display the results as a dendrogram.

clusts<-hclust(dist.matrix,method="ward.D2")
plot(clusts)

See the dendrogram here:

Now where exactly you want to set your parameters for automatically combining similar results is of course up to you, and depends upon the problem. There are the usual trade-offs between false positives and false negatives.

For this example, cutting at a distance of 0.2 seems appropriate, so:

my_df$LikelyGroup<-cutree(clusts,h=0.2)

where now my_df$LikelyGroup is the column of identifiers which has one number per individual, even if they are miss-spelled.

Now to name the groups, find the mode for each name/birthday column:

library(dplyr)

Mode <- function(x) {
  ux <- unique(x)
  ux[which.max(tabulate(match(x, ux)))]
}

my_df<-my_df%>%
  group_by(LikelyGroup)%>%
  mutate(Group_Birth=Mode(BIRTH),
         Group_Name=Mode(NAME),
         Group_Surname=Mode(SURNAME))

Output my_df:

 BIRTH|NAME |SURNAME   | random_value| LikelyGroup| Group_Birth|Group_Name |Group_Surname 
------|-----|----------|-------------|------------|------------|-----------|--------------
     1|Luke |Skywalker |            1|           1|           1|Luke       |Skywalker     
     1|Luke |Skywalker |            2|           1|           1|Luke       |Skywalker     
     2|Leia |Organa    |            3|           2|           2|Leia       |Organa        
     3|Han  |Solo      |            7|           3|           3|Han        |Solo          
     1|Ben  |Solo      |            1|           4|           1|Ben        |Solo          
     5|Lando|Calrissian|            3|           5|           5|Lando      |Calrissian    
     3|Han  |Solo      |            4|           3|           3|Han        |Solo          
     3|Ham  |Solo      |            4|           3|           3|Han        |Solo          
     1|Luke |Wkywalker |            9|           1|           1|Luke       |Skywalker

See gist at https://gist.github.com/gdmcdonald/9135ec8f7e903a0735a0b16d8cb97297

来源：https://stackoverflow.com/questions/45990947/how-to-find-a-typo-in-a-data-frame-and-replace-it

标签

dataframe

data-cleaning