I\'m trying to use the tm package in R to perform some text analysis. I tied the following:
require(tm)
dataSet <- Corpus(DirSource(\'tmp/\'))
dataSet <
I have often run into this issue and this Stack Overflow post is always what comes up first. I have used the top solution before, but it can strip out characters and replace them with garbage (like converting it’s
to it’s
).
I have found that there is actually a much better solution for this! If you install the stringi
package, you can replace tolower()
with stri_trans_tolower()
and then everything should work fine.
I have just run afoul of this problem. By chance are you using a machine running OSX? I am and seem to have traced the problem to the definition of the character set that R is compiled against on this operating system (see https://stat.ethz.ch/pipermail/r-sig-mac/2012-July/009374.html)
What I was seeing is that using the solution from the FAQ
tm_map(yourCorpus, function(x) iconv(enc2utf8(x), sub = "byte"))
was giving me this warning:
Warning message:
it is not known that wchar_t is Unicode on this platform
This I traced to the enc2utf8
function. Bad news is that this is a problem with my underlying OS and not R.
So here is what I did as a work around:
tm_map(yourCorpus, function(x) iconv(x, to='UTF-8-MAC', sub='byte'))
This forces iconv to use the utf8 encoding on the macintosh and works fine without the need to recompile.
This is a common issue with the tm
package (1, 2, 3).
One non-R
way to fix it is to use a text editor to find and replace all the fancy characters (ie. those with diacritics) in your text before loading it into R
(or use gsub
in R
). For example you'd search and replace all instances of the O-umlaut in Öl-Teppich. Others have had success with this (I have too), but if you have thousands of individual text files obviously this is no good.
For an R
solution, I found that using VectorSource
instead of DirSource
seems to solve the problem:
# I put your example text in a file and tested it with both ANSI and
# UTF-8 encodings, both enabled me to reproduce your problem
#
tmp <- Corpus(DirSource('C:\\...\\tmp/'))
tmp <- tm_map(dataSet, tolower)
Error in FUN(X[[1L]], ...) :
invalid input 'RT @noXforU Erneut riesiger (Alt-)Öl–teppich im Golf von Mexiko (#pics vom Freitag) http://bit.ly/bw1hvU http://bit.ly/9R7JCf #oilspill #bp' in 'utf8towcs'
# quite similar error to what you got, both from ANSI and UTF-8 encodings
#
# Now try VectorSource instead of DirSource
tmp <- readLines('C:\\...\\tmp.txt')
tmp
[1] "RT @noXforU Erneut riesiger (Alt-)Öl–teppich im Golf von Mexiko (#pics vom Freitag) http://bit.ly/bw1hvU http://bit.ly/9R7JCf #oilspill #bp"
# looks ok so far
tmp <- Corpus(VectorSource(tmp))
tmp <- tm_map(tmp, tolower)
tmp[[1]]
rt @noxforu erneut riesiger (alt-)öl–teppich im golf von mexiko (#pics vom freitag) http://bit.ly/bw1hvu http://bit.ly/9r7jcf #oilspill #bp
# seems like it's worked just fine. It worked for best for ANSI encoding.
# There was no error with UTF-8 encoding, but the Ö was returned
# as ã– which is not good
But this seems like a bit of a lucky coincidence. There must be a more direct way about it. Do let us know what works for you!
I was able to fix it by converting the data back to plain text format using this line of code
corpus <- tm_map(corpus, PlainTextDocument)
thanks to user https://stackoverflow.com/users/4386239/paul-gowder
for his response here
https://stackoverflow.com/a/29529990/815677
None of the above answers worked for me. The only way to work around this problem was to remove all non graphical characters (http://stat.ethz.ch/R-manual/R-patched/library/base/html/regex.html).
The code is this simple
usableText=str_replace_all(tweets$text,"[^[:graph:]]", " ")
The former suggestions didn't work for me. I investigated more and found the one that worked in the following https://eight2late.wordpress.com/2015/05/27/a-gentle-introduction-to-text-mining-using-r/
#Create the toSpace content transformer
toSpace <- content_transformer(function(x, pattern) {return (gsub(pattern," ",
x))})
# Apply it for substituting the regular expression given in one of the former answers by " "
your_corpus<- tm_map(your_corpus,toSpace,"[^[:graph:]]")
# the tolower transformation worked!
your_corpus <- tm_map(your_corpus, content_transformer(tolower))