问题
I have a table with misspelling words. I need to correct those using from the words more similar to that one, the one that have more frequency.
For example, after I run
aggregate(CustomerID ~ Province, ventas2, length)
I get
1
2 AMBA 29
3 BAIRES 1
4 BENOS AIRES 1
12 BUENAS AIRES 1
17 BUENOS AIRES 4
18 buenos aires 7
19 Buenos Aires 3
20 BUENOS AIRES 11337
35 CORDOBA 2297
36 cordoba 1
38 CORDOBESA 1
39 CORRIENTES 424
So I need to replace buenos aires, Buenos Aires, Baires, BUENOS AIRES, with BUENOS AIRES but AMBA shouldn't be replaced. Also CORDOBESA and cordoba should be replaced by CORDOBA, but not CORRIENTES.
How can I do this in R?
Thanks!
回答1:
Here's a possibile solution.
Disclaimer :
This code seems to works fine with your current example. I don't assure that the current parameters (e.g. cut height, cluster agglomeration method, distance method etc.) will be valid for your real (complete) data.
# recreating your data
data <-
read.csv(text=
'City,Occurr
AMBA,29
BAIRES,1
BENOS AIRES,1
BUENAS AIRES,1
BUENOS AIRES,4
buenos aires,7
Buenos Aires,3
BUENOS AIRES,11337
CORDOBA,2297
cordoba,1
CORDOBESA,1
CORRIENTES,424',stringsAsFactors=F)
# simple pre-processing to city strings:
# - removing spaces
# - turning strings to uppercase
cities <- gsub('\\s+','',toupper(data$City))
# string distance computation
# N.B. here you can play with single components of distance costs
d <- adist(cities, costs=list(insertions=1, deletions=1, substitutions=1))
# assign original cities names to distance matrix
rownames(d) <- data$City
# clustering cities
hc <- hclust(as.dist(d),method='single')
# plot the cluster dendrogram
plot(hc)
# add the cluster rectangles (just to see the clusters)
# N.B. I decided to cut at distance height < 5
# (read it as: "I consider equal 2 strings needing
# less than 5 modifications to pass from one to the other")
# Obviously you can use another value.
rect.hclust(hc,h=4.9)
# get the clusters ids
clusters <- cutree(hc,h=4.9)
# turn into data.frame
clusters <- data.frame(City=names(clusters),ClusterId=clusters)
# merge with frequencies
merged <- merge(data,clusters,all.x=T,by='City')
# add CityCorrected column to the merged data.frame
ret <- by(merged,
merged$ClusterId,
FUN=function(grp){
idx <- which.max(grp$Occur)
grp$CityCorrected <- grp[idx,'City']
return(grp)
})
fixed <- do.call(rbind,ret)
Result :
> fixed
City Occurr ClusterId CityCorrected
1 AMBA 29 1 AMBA
2.2 BAIRES 1 2 BUENOS AIRES
2.3 BENOS AIRES 1 2 BUENOS AIRES
2.4 BUENAS AIRES 1 2 BUENOS AIRES
2.5 BUENOS AIRES 4 2 BUENOS AIRES
2.6 buenos aires 7 2 BUENOS AIRES
2.7 Buenos Aires 3 2 BUENOS AIRES
2.8 BUENOS AIRES 11337 2 BUENOS AIRES
3.9 cordoba 1 3 CORDOBA
3.10 CORDOBA 2297 3 CORDOBA
3.11 CORDOBESA 1 3 CORDOBA
4 CORRIENTES 424 4 CORRIENTES
Cluster Plot :
回答2:
Here's my small replication of your aggregate result You'll need to change all the calls to data frames to fit whatever the structure of your data is.
df
#output
# word freq
#1 a 1
#2 b 2
#3 c 3
#find the max frequency
mostFrequent<-max(df[,2]) #doesn't handle ties
#find the word we will be replacing with
replaceString<-df[df[,2]==mostFrequent,1]
#[1] "c"
#find all the other words to be replaced
tobereplaced<-df[df[,2]!=mostFrequent,1]
#[1] "a" "b"
Now say you have the following dataframe which contains your entire dataset, I'll just replicate a single column with words.
totalData
# [,1]
#[1,] "a"
#[2,] "c"
#[3,] "b"
#[4,] "d"
#[5,] "f"
#[6,] "a"
#[7,] "d"
#[8,] "b"
#[9,] "c"
We can replace all the words we want to replace, with the string we want to replace them with, by the following call
totaldata[totaldata%in%tobereplaced]<-replaceString
# [,1]
#[1,] "c"
#[2,] "c"
#[3,] "c"
#[4,] "d"
#[5,] "f"
#[6,] "c"
#[7,] "d"
#[8,] "c"
#[9,] "c"
As you can see, all a's and b's have been replaced with c, where the other words are the same
来源:https://stackoverflow.com/questions/25752306/r-function-to-correct-words-by-frequency-of-more-proximate-word