问题
I have a list of phrases, in which I want to replace certain words with a similar word, in case it is misspelled.
How can I search a string, a word that matches and replace it?
The expected result is the following example:
a1<- c(" the classroom is ful ")
a2<- c(" full")
In this case I would be replacing ful for full in a1
回答1:
Take a look at the hunspell
package. As the comments have already suggested, your problem is much more difficult than it seems, unless you already have a dictionary of misspelled words and their correct spelling.
library(hunspell)
a1 <- c(" the classroom is ful ")
bads <- hunspell(a1)
bads
# [[1]]
# [1] "ful"
hunspell_suggest(bads[[1]])
# [[1]]
# [1] "fool" "flu" "fl" "fuel" "furl" "foul" "full" "fun" "fur" "fut" "fol" "fug" "fum"
So even in your example, would you want to replace ful
with full
, or many of the other options here?
The package does let you use your own dictionary. Let's say you're doing that, or at least you're happy with the first returned suggestion.
library(stringr)
str_replace_all(a1, bads[[1]], hunspell_suggest(bads[[1]])[[1]][1])
# [1] " the classroom is fool "
But, as the other comments and answers have pointed out, you do need to be careful with the word showing up within other words.
a3 <- c(" the thankful classroom is ful ")
str_replace_all(a3,
paste("\\b",
hunspell(a3)[[1]],
"\\b",
collapse = "", sep = ""),
hunspell_suggest(hunspell(a3)[[1]])[[1]][1])
# [1] " the thankful classroom is fool "
Update
Based on your comment, you already have a dictionary, structured as a vector of badwords and another vector of their replacements.
library(stringr)
a4 <- "I would like a cheseburger and friees please"
badwords.corpus <- c("cheseburger", "friees")
goodwords.corpus <- c("cheeseburger", "fries")
vect.corpus <- goodwords.corpus
names(vect.corpus) <- badwords.corpus
str_replace_all(a4, vect.corpus)
# [1] "I would like a cheeseburger and fries please"
Update 2
Addressing your comment, with your new example the issue is back to having words showing up in other words. The solutions is to use \\b
. This represents a word boundary. Using pattern "thin" it will match to "thin", "think", "thinking", etc. But if you bracket with \\b
it anchors the pattern to a word boundary. \\bthin\\b
will only match "thin".
Your example:
a <- c(" thin, thic, thi")
badwords.corpus <- c("thin", "thic", "thi" )
goodwords.corpus <- c("think", "thick", "this")
The solution is to modify badwords.corpus
badwords.corpus <- paste("\\b", badwords.corpus, "\\b", sep = "")
badwords.corpus
# [1] "\\bthin\\b" "\\bthic\\b" "\\bthi\\b"
Then create the vect.corpus as I describe in the previous update, and use in str_replace_all
.
vect.corpus <- goodwords.corpus
names(vect.corpus) <- badwords.corpus
str_replace_all(a, vect.corpus)
# [1] " think, thick, this"
回答2:
I think the function you are looking for is gsub():
gsub (pattern = "ful", replacement = a2, x = a1)
回答3:
Create a list of the corrections then replace them using gsubfn
which is a generalization of gsub
that can also take list, function and proto object replacement objects. The regular expression matches a word boundary, one or more word characters and another word boundary. Each time it finds a match it looks up the match in the list names and if found replaces it with the corresponding list value.
library(gsubfn)
L <- list(ful = "full") # can add more words to this list if desired
gsubfn("\\b\\w+\\b", L, a1, perl = TRUE)
## [1] " the classroom is full "
回答4:
For a kind of ordered replacement, you can try this
a1 <- c("the classroome is ful")
# ordered replacement
badwords.corpus <- c("ful", "classroome")
goodwords.corpus <- c("full", "classroom")
qdap::mgsub(badwords.corpus, goodwords.corpus, a1) # or
stringi::stri_replace_all_fixed(a1, badwords.corpus, goodwords.corpus, vectorize_all = FALSE)
For unordered replacement you can use an approximate string matching (see stringdist::amatch
). Here is an example
a1 <- c("the classroome is ful")
a1
[1] "the classroome is ful"
library(stringdist)
goodwords.corpus <- c("full", "classroom")
badwords.corpus <- unlist(strsplit(a1, " ")) # extract words
for (badword in badwords.corpus){
patt <- paste0('\\b', badword, '\\b')
repl <- goodwords.corpus[amatch(badword, goodwords.corpus, maxDist = 1)] # you can change the distance see ?amatch
final.word <- ifelse(is.na(repl), badword, repl)
a1 <- gsub(patt, final.word, a1)
}
a1
[1] "the classroom is full"
来源:https://stackoverflow.com/questions/47640925/replacement-of-words-in-strings