R tm package invalid input in 'utf8towcs'

问题

I\'m trying to use the tm package in R to perform some text analysis. I tied the following:

require(tm)
dataSet <- Corpus(DirSource(\'tmp/\'))
dataSet <- tm_map(dataSet, tolower)
Error in FUN(X[[6L]], ...) : invalid input \'RT @noXforU Erneut riesiger (Alt-)�lteppich im Golf von Mexiko (#pics vom Freitag) http://bit.ly/bw1hvU http://bit.ly/9R7JCf #oilspill #bp\' in \'utf8towcs\'

The problem is some characters are not valid. I\'d like to exclude the invalid characters from analysis either from within R or before importing the files for processing.

I tried using iconv to convert all files to utf-8 and exclude anything that can\'t be converted to that as follows:

find . -type f -exec iconv -t utf-8 \"{}\" -c -o tmpConverted/\"{}\" \\;

as pointed out here Batch convert latin-1 files to utf-8 using iconv

But I still get the same error.

I\'d appreciate any help.

回答1:

None of the above answers worked for me. The only way to work around this problem was to remove all non graphical characters (http://stat.ethz.ch/R-manual/R-patched/library/base/html/regex.html).

The code is this simple

usableText=str_replace_all(tweets$text,"[^[:graph:]]", " ")

回答2:

This is from the tm faq:

it will replace non-convertible bytes in yourCorpus with strings showing their hex codes.

I hope this helps, for me it does.

tm_map(yourCorpus, function(x) iconv(enc2utf8(x), sub = "byte"))

http://tm.r-forge.r-project.org/faq.html

回答3:

I think it is clear by now that the problem is because of the emojis that tolower is not able to understand

#to remove emojis
dataSet <- iconv(dataSet, 'UTF-8', 'ASCII')

回答4:

I have just run afoul of this problem. By chance are you using a machine running OSX? I am and seem to have traced the problem to the definition of the character set that R is compiled against on this operating system (see https://stat.ethz.ch/pipermail/r-sig-mac/2012-July/009374.html)

What I was seeing is that using the solution from the FAQ

tm_map(yourCorpus, function(x) iconv(enc2utf8(x), sub = "byte"))

was giving me this warning:

Warning message:
it is not known that wchar_t is Unicode on this platform

This I traced to the enc2utf8 function. Bad news is that this is a problem with my underlying OS and not R.

So here is what I did as a work around:

tm_map(yourCorpus, function(x) iconv(x, to='UTF-8-MAC', sub='byte'))

This forces iconv to use the utf8 encoding on the macintosh and works fine without the need to recompile.

回答5:

I have been running this on Mac and to my frustration,I had to identify the foul record (as these were tweets) to resolve. Since the next time, there is no guarantee of the record being the same, I used the following function

tm_map(yourCorpus, function(x) iconv(x, to='UTF-8-MAC', sub='byte'))

as suggested above.

It worked like a charm

回答6:

This is a common issue with the tm package (1, 2, 3).

One non-R way to fix it is to use a text editor to find and replace all the fancy characters (ie. those with diacritics) in your text before loading it into R (or use gsub in R). For example you'd search and replace all instances of the O-umlaut in Öl-Teppich. Others have had success with this (I have too), but if you have thousands of individual text files obviously this is no good.

For an R solution, I found that using VectorSource instead of DirSource seems to solve the problem:

# I put your example text in a file and tested it with both ANSI and 
# UTF-8 encodings, both enabled me to reproduce your problem
#
tmp <- Corpus(DirSource('C:\\...\\tmp/'))
tmp <- tm_map(dataSet, tolower)
Error in FUN(X[[1L]], ...) : 
  invalid input 'RT @noXforU Erneut riesiger (Alt-)Öl–teppich im Golf von Mexiko (#pics vom Freitag) http://bit.ly/bw1hvU http://bit.ly/9R7JCf #oilspill #bp' in 'utf8towcs'
# quite similar error to what you got, both from ANSI and UTF-8 encodings
#
# Now try VectorSource instead of DirSource
tmp <- readLines('C:\\...\\tmp.txt') 
tmp
[1] "RT @noXforU Erneut riesiger (Alt-)Öl–teppich im Golf von Mexiko (#pics vom Freitag) http://bit.ly/bw1hvU http://bit.ly/9R7JCf #oilspill #bp"
# looks ok so far
tmp <- Corpus(VectorSource(tmp))
tmp <- tm_map(tmp, tolower)
tmp[[1]]
rt @noxforu erneut riesiger (alt-)öl–teppich im golf von mexiko (#pics vom freitag) http://bit.ly/bw1hvu http://bit.ly/9r7jcf #oilspill #bp
# seems like it's worked just fine. It worked for best for ANSI encoding. 
# There was no error with UTF-8 encoding, but the Ö was returned 
# as ã– which is not good

But this seems like a bit of a lucky coincidence. There must be a more direct way about it. Do let us know what works for you!

回答7:

The former suggestions didn't work for me. I investigated more and found the one that worked in the following https://eight2late.wordpress.com/2015/05/27/a-gentle-introduction-to-text-mining-using-r/

#Create the toSpace content transformer
toSpace <- content_transformer(function(x, pattern) {return (gsub(pattern," ",
x))})
# Apply it for substituting the regular expression given in one of the former answers by " "
your_corpus<- tm_map(your_corpus,toSpace,"[^[:graph:]]")

# the tolower transformation worked!
your_corpus <- tm_map(your_corpus, content_transformer(tolower))

回答8:

I have often run into this issue and this Stack Overflow post is always what comes up first. I have used the top solution before, but it can strip out characters and replace them with garbage (like converting it’s to itâ€™s).

I have found that there is actually a much better solution for this! If you install the stringi package, you can replace tolower() with stri_trans_tolower() and then everything should work fine.

回答9:

Use the following steps:

# First you change your document in .txt format with encoding UFT-8
library(tm)
# Set Your directoryExample ("F:/tmp").
dataSet <- Corpus(DirSource ("/tmp"), readerControl=list(language="english)) # "/tmp" is your directory. You can use any language in place of English whichever allowed by R.
dataSet <- tm_map(dataSet, tolower)

Inspect(dataSet)

回答10:

If it's alright to ignore invalid inputs, you could use R's error handling. e.g:

  dataSet <- Corpus(DirSource('tmp/'))
  dataSet <- tm_map(dataSet, function(data) {
     #ERROR HANDLING
     possibleError <- tryCatch(
         tolower(data),
         error=function(e) e
     )

     # if(!inherits(possibleError, "error")){
     #   REAL WORK. Could do more work on your data here,
     #   because you know the input is valid.
     #   useful(data); fun(data); good(data);
     # }
  })

There is an additional example here: http://gastonsanchez.wordpress.com/2012/05/29/catching-errors-when-using-tolower/

回答11:

The official FAQ seems to be not working in my situation:

tm_map(yourCorpus, function(x) iconv(x, to='UTF-8-MAC', sub='byte'))

Finally I made it using the for & Encoding function:

for (i in 1:length(dataSet))
{
  Encoding(corpus[[i]])="UTF-8"
}
corpus <- tm_map(dataSet, tolower)

回答12:

Chad's solution wasn't working for me. I had this embedded in a function and it was giving an error about iconv neededing a vector as input. So, I decided to do the conversion before creating the corpus.

myCleanedText <- sapply(myText, function(x) iconv(enc2utf8(x), sub = "byte"))

回答13:

I was able to fix it by converting the data back to plain text format using this line of code

corpus <- tm_map(corpus, PlainTextDocument)

thanks to user https://stackoverflow.com/users/4386239/paul-gowder

for his response here

https://stackoverflow.com/a/29529990/815677

来源：https://stackoverflow.com/questions/9637278/r-tm-package-invalid-input-in-utf8towcs

标签

utf-8

iconv

text-mining

R tm package invalid input in &#39;utf8towcs&#39;

问题