R tm package invalid input in 'utf8towcs'

前端 未结 14 1298
逝去的感伤
逝去的感伤 2020-11-29 01:47

I\'m trying to use the tm package in R to perform some text analysis. I tied the following:

require(tm)
dataSet <- Corpus(DirSource(\'tmp/\'))
dataSet <         


        
相关标签:
14条回答
  • 2020-11-29 01:56

    I have often run into this issue and this Stack Overflow post is always what comes up first. I have used the top solution before, but it can strip out characters and replace them with garbage (like converting it’s to it’s).

    I have found that there is actually a much better solution for this! If you install the stringi package, you can replace tolower() with stri_trans_tolower() and then everything should work fine.

    0 讨论(0)
  • 2020-11-29 01:59

    I have just run afoul of this problem. By chance are you using a machine running OSX? I am and seem to have traced the problem to the definition of the character set that R is compiled against on this operating system (see https://stat.ethz.ch/pipermail/r-sig-mac/2012-July/009374.html)

    What I was seeing is that using the solution from the FAQ

    tm_map(yourCorpus, function(x) iconv(enc2utf8(x), sub = "byte"))
    

    was giving me this warning:

    Warning message:
    it is not known that wchar_t is Unicode on this platform 
    

    This I traced to the enc2utf8 function. Bad news is that this is a problem with my underlying OS and not R.

    So here is what I did as a work around:

    tm_map(yourCorpus, function(x) iconv(x, to='UTF-8-MAC', sub='byte'))
    

    This forces iconv to use the utf8 encoding on the macintosh and works fine without the need to recompile.

    0 讨论(0)
  • 2020-11-29 01:59

    This is a common issue with the tm package (1, 2, 3).

    One non-R way to fix it is to use a text editor to find and replace all the fancy characters (ie. those with diacritics) in your text before loading it into R (or use gsub in R). For example you'd search and replace all instances of the O-umlaut in Öl-Teppich. Others have had success with this (I have too), but if you have thousands of individual text files obviously this is no good.

    For an R solution, I found that using VectorSource instead of DirSource seems to solve the problem:

    # I put your example text in a file and tested it with both ANSI and 
    # UTF-8 encodings, both enabled me to reproduce your problem
    #
    tmp <- Corpus(DirSource('C:\\...\\tmp/'))
    tmp <- tm_map(dataSet, tolower)
    Error in FUN(X[[1L]], ...) : 
      invalid input 'RT @noXforU Erneut riesiger (Alt-)Öl–teppich im Golf von Mexiko (#pics vom Freitag) http://bit.ly/bw1hvU http://bit.ly/9R7JCf #oilspill #bp' in 'utf8towcs'
    # quite similar error to what you got, both from ANSI and UTF-8 encodings
    #
    # Now try VectorSource instead of DirSource
    tmp <- readLines('C:\\...\\tmp.txt') 
    tmp
    [1] "RT @noXforU Erneut riesiger (Alt-)Öl–teppich im Golf von Mexiko (#pics vom Freitag) http://bit.ly/bw1hvU http://bit.ly/9R7JCf #oilspill #bp"
    # looks ok so far
    tmp <- Corpus(VectorSource(tmp))
    tmp <- tm_map(tmp, tolower)
    tmp[[1]]
    rt @noxforu erneut riesiger (alt-)öl–teppich im golf von mexiko (#pics vom freitag) http://bit.ly/bw1hvu http://bit.ly/9r7jcf #oilspill #bp
    # seems like it's worked just fine. It worked for best for ANSI encoding. 
    # There was no error with UTF-8 encoding, but the Ö was returned 
    # as ã– which is not good
    

    But this seems like a bit of a lucky coincidence. There must be a more direct way about it. Do let us know what works for you!

    0 讨论(0)
  • 2020-11-29 02:02

    I was able to fix it by converting the data back to plain text format using this line of code

    corpus <- tm_map(corpus, PlainTextDocument)

    thanks to user https://stackoverflow.com/users/4386239/paul-gowder

    for his response here

    https://stackoverflow.com/a/29529990/815677

    0 讨论(0)
  • 2020-11-29 02:03

    None of the above answers worked for me. The only way to work around this problem was to remove all non graphical characters (http://stat.ethz.ch/R-manual/R-patched/library/base/html/regex.html).

    The code is this simple

    usableText=str_replace_all(tweets$text,"[^[:graph:]]", " ") 
    
    0 讨论(0)
  • 2020-11-29 02:03

    The former suggestions didn't work for me. I investigated more and found the one that worked in the following https://eight2late.wordpress.com/2015/05/27/a-gentle-introduction-to-text-mining-using-r/

    #Create the toSpace content transformer
    toSpace <- content_transformer(function(x, pattern) {return (gsub(pattern," ",
    x))})
    # Apply it for substituting the regular expression given in one of the former answers by " "
    your_corpus<- tm_map(your_corpus,toSpace,"[^[:graph:]]")
    
    # the tolower transformation worked!
    your_corpus <- tm_map(your_corpus, content_transformer(tolower))
    
    0 讨论(0)
提交回复
热议问题