问题
I'm currently working with a large data frame containing lots of text in each row and would like to effectively identify and replace misspelled words in each sentence with the hunspell
package. I was able to identify the misspelled words, but can't figure out how to do hunspell_suggest
on a list.
Here is an example of the data frame:
df1 <- data.frame("Index" = 1:7, "Text" = c("A complec sentence joins an independet",
"Mary and Samantha arived at the bus staton before noon",
"I did not see thm at the station in the mrning",
"The participnts read 60 sentences in radom order",
"how to fix mispelled words in R languge",
"today is Tuesday",
"bing sports quiz"))
I converted the text column into character and used hunspell
to identify the misspelled words within each row.
library(hunspell)
df1$Text <- as.character(df1$Text)
df1$word_check <- hunspell(df1$Text)
I tried
df1$suggest <- hunspell_suggest(df1$word_check)
but it keeps giving this error:
Error in hunspell_suggest(df1$word_check) :
is.character(words) is not TRUE
I'm new to this so I'm not exactly sure how does the suggest column using hunspell_suggest
function would turn out. Any help will be greatly appreciated.
回答1:
Check your intermediate steps. The output of df1$word_check
is as follows:
List of 5
$ : chr [1:2] "complec" "independet"
$ : chr [1:2] "arived" "staton"
$ : chr [1:2] "thm" "mrning"
$ : chr [1:2] "participnts" "radom"
$ : chr [1:2] "mispelled" "languge"
which is of type list
. If you did lapply(df1$word_check, hunspell_suggest)
you can get the suggestions.
EDIT
I decided to go into more detail on this question as I have not seen any easy alternative. This is what I have come up with:
cleantext = function(x){
sapply(1:length(x),function(y){
bad = hunspell(x[y])[[1]]
good = unlist(lapply(hunspell_suggest(bad),`[[`,1))
if (length(bad)){
for (i in 1:length(bad)){
x[y] <<- gsub(bad[i],good[i],x[y])
}}})
x
}
Although there probably is a more elegant way of doing it, this function returns a vector of character strings corrected as such:
> df1$Text
[1] "A complec sentence joins an independet"
[2] "Mary and Samantha arived at the bus staton before noon"
[3] "I did not see thm at the station in the mrning"
[4] "The participnts read 60 sentences in radom order"
[5] "how to fix mispelled words in R languge"
[6] "today is Tuesday"
[7] "bing sports quiz"
> cleantext(df1$Text)
[1] "A complex sentence joins an independent"
[2] "Mary and Samantha rived at the bus station before noon"
[3] "I did not see them at the station in the morning"
[4] "The participants read 60 sentences in radon order"
[5] "how to fix misspelled words in R language"
[6] "today is Tuesday"
[7] "bung sports quiz"
Watch out, as this returns the first suggestion given by hunspell
- which may or may not be correct.
来源:https://stackoverflow.com/questions/56026550/how-to-use-hunspell-package-to-suggest-correct-words-in-a-column-in-r