R function to correct words by frequency of more proximate word

烈酒焚心 提交于 2021-01-28 22:15:05

问题


I have a table with misspelling words. I need to correct those using from the words more similar to that one, the one that have more frequency.

For example, after I run

aggregate(CustomerID ~ Province, ventas2, length)

I get

1                             
2                     AMBA         29
    3                   BAIRES          1
    4              BENOS AIRES          1

    12            BUENAS AIRES          1

    17           BUENOS  AIRES          4
    18            buenos aires          7
    19            Buenos Aires          3
    20            BUENOS AIRES      11337
    35                 CORDOBA       2297
    36                cordoba           1
    38               CORDOBESA          1
    39              CORRIENTES        424

So I need to replace buenos aires, Buenos Aires, Baires, BUENOS AIRES, with BUENOS AIRES but AMBA shouldn't be replaced. Also CORDOBESA and cordoba should be replaced by CORDOBA, but not CORRIENTES.

How can I do this in R?

Thanks!


回答1:


Here's a possibile solution.

Disclaimer :
This code seems to works fine with your current example. I don't assure that the current parameters (e.g. cut height, cluster agglomeration method, distance method etc.) will be valid for your real (complete) data.

# recreating your data
data <- 
read.csv(text=
'City,Occurr
AMBA,29
BAIRES,1
BENOS AIRES,1
BUENAS AIRES,1
BUENOS  AIRES,4
buenos aires,7
Buenos Aires,3
BUENOS AIRES,11337
CORDOBA,2297
cordoba,1
CORDOBESA,1
CORRIENTES,424',stringsAsFactors=F)


# simple pre-processing to city strings:
# - removing spaces
# - turning strings to uppercase
cities <- gsub('\\s+','',toupper(data$City))

# string distance computation
# N.B. here you can play with single components of distance costs 
d <- adist(cities, costs=list(insertions=1, deletions=1, substitutions=1))
# assign original cities names to distance matrix
rownames(d) <- data$City
# clustering cities
hc <- hclust(as.dist(d),method='single')

# plot the cluster dendrogram
plot(hc)
# add the cluster rectangles (just to see the clusters) 
# N.B. I decided to cut at distance height < 5
#      (read it as: "I consider equal 2 strings needing
#       less than 5 modifications to pass from one to the other")
#      Obviously you can use another value.
rect.hclust(hc,h=4.9)

# get the clusters ids
clusters <- cutree(hc,h=4.9) 
# turn into data.frame
clusters <- data.frame(City=names(clusters),ClusterId=clusters)

# merge with frequencies
merged <- merge(data,clusters,all.x=T,by='City') 

# add CityCorrected column to the merged data.frame
ret <- by(merged, 
          merged$ClusterId,
          FUN=function(grp){
                idx <- which.max(grp$Occur)
                grp$CityCorrected <- grp[idx,'City']
                return(grp)
              })

fixed <- do.call(rbind,ret)

Result :

> fixed
              City Occurr ClusterId CityCorrected
1             AMBA     29         1          AMBA
2.2         BAIRES      1         2  BUENOS AIRES
2.3    BENOS AIRES      1         2  BUENOS AIRES
2.4   BUENAS AIRES      1         2  BUENOS AIRES
2.5  BUENOS  AIRES      4         2  BUENOS AIRES
2.6   buenos aires      7         2  BUENOS AIRES
2.7   Buenos Aires      3         2  BUENOS AIRES
2.8   BUENOS AIRES  11337         2  BUENOS AIRES
3.9        cordoba      1         3       CORDOBA
3.10       CORDOBA   2297         3       CORDOBA
3.11     CORDOBESA      1         3       CORDOBA
4       CORRIENTES    424         4    CORRIENTES

Cluster Plot :

enter image description here




回答2:


Here's my small replication of your aggregate result You'll need to change all the calls to data frames to fit whatever the structure of your data is.

df
#output
#       word freq
#1         a    1
#2         b    2
#3         c    3

#find the max frequency
mostFrequent<-max(df[,2])  #doesn't handle ties

#find the word we will be replacing with
replaceString<-df[df[,2]==mostFrequent,1]
#[1] "c"

#find all the other words to be replaced
tobereplaced<-df[df[,2]!=mostFrequent,1]
#[1] "a" "b"

Now say you have the following dataframe which contains your entire dataset, I'll just replicate a single column with words.

totalData
 #    [,1]
 #[1,] "a" 
 #[2,] "c" 
 #[3,] "b" 
 #[4,] "d" 
 #[5,] "f" 
 #[6,] "a" 
 #[7,] "d" 
 #[8,] "b" 
 #[9,] "c" 

We can replace all the words we want to replace, with the string we want to replace them with, by the following call

totaldata[totaldata%in%tobereplaced]<-replaceString
 #    [,1]
 #[1,] "c" 
 #[2,] "c" 
 #[3,] "c" 
 #[4,] "d" 
 #[5,] "f" 
 #[6,] "c" 
 #[7,] "d" 
 #[8,] "c" 
 #[9,] "c"

As you can see, all a's and b's have been replaced with c, where the other words are the same



来源:https://stackoverflow.com/questions/25752306/r-function-to-correct-words-by-frequency-of-more-proximate-word

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!