问题
I'm using the tm package to clean up a Twitter Corpus. However, the package is unable to clean up emoticons.
Here's a replicated code:
July4th_clean <- tm_map(July4th_clean, content_transformer(tolower))
Error in FUN(content(x), ...) : invalid input 'RT ElleJohnson Love of country is encircling the globes ������������������ july4thweekend July4th FourthOfJuly IndependenceDay NotAvailableOnIn' in 'utf8towcs'
Can someone point me in the right direction to remove the emoticons using the tm package?
Thank you,
Luis
回答1:
You can use gsub
to get rid of all non-ASCII characters.
Texts = c("Let the stormy clouds chase, everyone from the place ☁ ♪ ♬",
"See you soon brother ☮ ",
"A boring old-fashioned message" )
gsub("[^\x01-\x7F]", "", Texts)
[1] "Let the stormy clouds chase, everyone from the place "
[2] "See you soon brother "
[3] "A boring old-fashioned message"
Details:
You can specify character classes in regex's with [ ]
. When the class description starts with ^
it means everything except these characters. Here, I have specified everything except characters 1-127, i.e. everything except standard ASCII and I have specified that they should be replaced with the empty string.
回答2:
you can try this function
iconv(July4th_clean, "latin1", "ASCII", sub="")
Duplicate issue, see post
来源:https://stackoverflow.com/questions/44893354/remove-emoticons-in-r-using-tm-package