How to remove non UTF-8 characters from text

放肆的年华 提交于 2021-02-17 06:25:07

问题


I need help removing non UTF-8 character from my word cloud. So far this is my code. I've tried gsub and removeWords and they are still there in my word cloud and I do not know what to do to get rid of them. Any help would be appreciated. Thank you for your time.

txt <- readLines("11-0.txt")
corpus = VCorpus(VectorSource(txt))
gsub("’","‘","",txt)

corpus = tm_map(corpus, content_transformer(tolower))
corpus = tm_map(corpus, removeWords, stopwords("english"))
corpus = tm_map(corpus, removePunctuation)
corpus = tm_map(corpus, stripWhitespace) 
corpus = tm_map(corpus, removeWords, c("gutenberg","gutenbergtm","â€","project"))

tdm = TermDocumentMatrix(corpus)
m = as.matrix(tdm)
v = sort(rowSums(m),decreasing = TRUE)
d = data.frame(word=names(v),freq=v)

wordcloud(d$word,d$freq,max.words = 20, random.order=FALSE, rot.per=0.2, colors=brewer.pal(8, "Dark2"))

Edit: Here is my inconv version

txt <- readLines("11-0.txt")
Encoding(txt) <- "latin1"
iconv(txt, "latin1", "ASCII", sub="")

corpus = VCorpus(VectorSource(txt))
corpus = tm_map(corpus, content_transformer(tolower))
corpus = tm_map(corpus, removeWords, stopwords("english"))
corpus = tm_map(corpus, removePunctuation)
corpus = tm_map(corpus, stripWhitespace) 
corpus = tm_map(corpus, removeWords, c("gutenberg","gutenbergtm","project"))

tdm = TermDocumentMatrix(corpus)
m = as.matrix(tdm)
v = sort(rowSums(m),decreasing = TRUE)
d = data.frame(word=names(v),freq=v)

wordcloud(d$word,d$freq,max.words = 20, random.order=FALSE, rot.per=0.2, colors=brewer.pal(8, "Dark2"))
title(main="Alice in Wonderland word cloud",font.main=1,cex.main =1.5)

回答1:


The signature of gsub is:

gsub(pattern, replacement, x, ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE)

Not sure what you wanted to do with

gsub("’","‘","",txt)

but that line is probably not doing what you want it to do...

See here for a previous SO question on gsub and non-ascii symbols.

Edit:

Suggested solution using iconv:

Removing all non-ASCII characters:

txt <- "’xxx‘"

iconv(txt, "latin1", "ASCII", sub="")

Returns:

[1] "xxx"    


来源:https://stackoverflow.com/questions/60259657/how-to-remove-non-utf-8-characters-from-text

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!