R tm package invalid input in 'utf8towcs'

前端 未结 14 1299
逝去的感伤
逝去的感伤 2020-11-29 01:47

I\'m trying to use the tm package in R to perform some text analysis. I tied the following:

require(tm)
dataSet <- Corpus(DirSource(\'tmp/\'))
dataSet <         


        
相关标签:
14条回答
  • 2020-11-29 02:20

    The official FAQ seems to be not working in my situation:

    tm_map(yourCorpus, function(x) iconv(x, to='UTF-8-MAC', sub='byte'))
    

    Finally I made it using the for & Encoding function:

    for (i in 1:length(dataSet))
    {
      Encoding(corpus[[i]])="UTF-8"
    }
    corpus <- tm_map(dataSet, tolower)
    
    0 讨论(0)
  • 2020-11-29 02:23

    If it's alright to ignore invalid inputs, you could use R's error handling. e.g:

      dataSet <- Corpus(DirSource('tmp/'))
      dataSet <- tm_map(dataSet, function(data) {
         #ERROR HANDLING
         possibleError <- tryCatch(
             tolower(data),
             error=function(e) e
         )
    
         # if(!inherits(possibleError, "error")){
         #   REAL WORK. Could do more work on your data here,
         #   because you know the input is valid.
         #   useful(data); fun(data); good(data);
         # }
      }) 
    

    There is an additional example here: http://gastonsanchez.wordpress.com/2012/05/29/catching-errors-when-using-tolower/

    0 讨论(0)
提交回复
热议问题