replacement of words in strings

家住魔仙堡 提交于 2019-12-12 17:43:26

问题


I have a list of phrases, in which I want to replace certain words with a similar word, in case it is misspelled.

How can I search a string, a word that matches and replace it?

The expected result is the following example:

a1<- c(" the classroom is ful ")
a2<- c(" full")

In this case I would be replacing ful for full in a1


回答1:


Take a look at the hunspell package. As the comments have already suggested, your problem is much more difficult than it seems, unless you already have a dictionary of misspelled words and their correct spelling.

library(hunspell)
a1 <- c(" the classroom is ful ")
bads <- hunspell(a1)
bads
# [[1]]
# [1] "ful"
hunspell_suggest(bads[[1]])
# [[1]]
#  [1] "fool" "flu"  "fl"   "fuel" "furl" "foul" "full" "fun"  "fur"  "fut"  "fol"  "fug"  "fum" 

So even in your example, would you want to replace ful with full, or many of the other options here?

The package does let you use your own dictionary. Let's say you're doing that, or at least you're happy with the first returned suggestion.

library(stringr)
str_replace_all(a1, bads[[1]], hunspell_suggest(bads[[1]])[[1]][1])
# [1] " the classroom is fool "

But, as the other comments and answers have pointed out, you do need to be careful with the word showing up within other words.

a3 <- c(" the thankful classroom is ful ")
str_replace_all(a3, 
                paste("\\b", 
                      hunspell(a3)[[1]], 
                      "\\b", 
                      collapse = "", sep = ""), 
                hunspell_suggest(hunspell(a3)[[1]])[[1]][1])
# [1] " the thankful classroom is fool "

Update

Based on your comment, you already have a dictionary, structured as a vector of badwords and another vector of their replacements.

library(stringr)
a4 <- "I would like a cheseburger and friees please"
badwords.corpus <- c("cheseburger", "friees")
goodwords.corpus <- c("cheeseburger", "fries")

vect.corpus <- goodwords.corpus
names(vect.corpus) <- badwords.corpus

str_replace_all(a4, vect.corpus)
# [1] "I would like a cheeseburger and fries please"

Update 2

Addressing your comment, with your new example the issue is back to having words showing up in other words. The solutions is to use \\b. This represents a word boundary. Using pattern "thin" it will match to "thin", "think", "thinking", etc. But if you bracket with \\b it anchors the pattern to a word boundary. \\bthin\\b will only match "thin".

Your example:

a <- c(" thin, thic, thi") 
badwords.corpus <- c("thin", "thic", "thi" ) 
goodwords.corpus <- c("think", "thick", "this")

The solution is to modify badwords.corpus

badwords.corpus <- paste("\\b", badwords.corpus, "\\b", sep = "")
badwords.corpus
# [1] "\\bthin\\b" "\\bthic\\b" "\\bthi\\b"

Then create the vect.corpus as I describe in the previous update, and use in str_replace_all.

vect.corpus <- goodwords.corpus
names(vect.corpus) <- badwords.corpus

str_replace_all(a, vect.corpus)
# [1] " think, thick, this" 



回答2:


I think the function you are looking for is gsub():

gsub (pattern = "ful", replacement = a2, x = a1)



回答3:


Create a list of the corrections then replace them using gsubfn which is a generalization of gsub that can also take list, function and proto object replacement objects. The regular expression matches a word boundary, one or more word characters and another word boundary. Each time it finds a match it looks up the match in the list names and if found replaces it with the corresponding list value.

library(gsubfn)

L <- list(ful = "full")  # can add more words to this list if desired

gsubfn("\\b\\w+\\b", L, a1, perl = TRUE)
## [1] " the classroom is full "



回答4:


For a kind of ordered replacement, you can try this

a1 <- c("the classroome is ful")
# ordered replacement
badwords.corpus <- c("ful", "classroome")
goodwords.corpus <- c("full", "classroom")

qdap::mgsub(badwords.corpus, goodwords.corpus, a1) # or
stringi::stri_replace_all_fixed(a1, badwords.corpus, goodwords.corpus, vectorize_all = FALSE)

For unordered replacement you can use an approximate string matching (see stringdist::amatch). Here is an example

a1 <- c("the classroome is ful")
a1
[1] "the classroome is ful"

library(stringdist)
goodwords.corpus <- c("full", "classroom")
badwords.corpus <- unlist(strsplit(a1, " ")) # extract words
for (badword in badwords.corpus){
  patt <- paste0('\\b', badword, '\\b')
  repl <- goodwords.corpus[amatch(badword, goodwords.corpus, maxDist = 1)] # you can change the distance see ?amatch
  final.word <- ifelse(is.na(repl), badword, repl)
  a1 <- gsub(patt, final.word, a1)
}
a1
[1] "the classroom is full"


来源:https://stackoverflow.com/questions/47640925/replacement-of-words-in-strings

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!