R tm package invalid input in 'utf8towcs'

前端 未结 14 1300
逝去的感伤
逝去的感伤 2020-11-29 01:47

I\'m trying to use the tm package in R to perform some text analysis. I tied the following:

require(tm)
dataSet <- Corpus(DirSource(\'tmp/\'))
dataSet <         


        
相关标签:
14条回答
  • 2020-11-29 02:06

    Use the following steps:

    # First you change your document in .txt format with encoding UFT-8
    library(tm)
    # Set Your directoryExample ("F:/tmp").
    dataSet <- Corpus(DirSource ("/tmp"), readerControl=list(language="english)) # "/tmp" is your directory. You can use any language in place of English whichever allowed by R.
    dataSet <- tm_map(dataSet, tolower)
    
    Inspect(dataSet)
    
    0 讨论(0)
  • 2020-11-29 02:07

    I have been running this on Mac and to my frustration,I had to identify the foul record (as these were tweets) to resolve. Since the next time, there is no guarantee of the record being the same, I used the following function

    tm_map(yourCorpus, function(x) iconv(x, to='UTF-8-MAC', sub='byte'))
    

    as suggested above.

    It worked like a charm

    0 讨论(0)
  • 2020-11-29 02:07

    I had the same problem in my mac, solved via below solution.

    raw_data <- read.csv(file.choose(), stringsAsFactors = F,  encoding="UTF-8")
    
    raw_data$textCol<- iconv(raw_data$textCol, "ASCII", "UTF-8", sub="byte")
    
    data_corpus <- VCorpus(VectorSource(raw_data$textCol))
    
    corpus_clean = tm_map(data_corpus, function(x) iconv(x, to='UTF-8-MAC', sub='byte'))
    
    corpus_clean <- tm_map(data_corpus, content_transformer(tolower))
    
    0 讨论(0)
  • 2020-11-29 02:15

    Chad's solution wasn't working for me. I had this embedded in a function and it was giving an error about iconv neededing a vector as input. So, I decided to do the conversion before creating the corpus.

    myCleanedText <- sapply(myText, function(x) iconv(enc2utf8(x), sub = "byte"))
    
    0 讨论(0)
  • 2020-11-29 02:16

    This is from the tm faq:

    it will replace non-convertible bytes in yourCorpus with strings showing their hex codes.

    I hope this helps, for me it does.

    tm_map(yourCorpus, function(x) iconv(enc2utf8(x), sub = "byte"))
    

    http://tm.r-forge.r-project.org/faq.html

    0 讨论(0)
  • 2020-11-29 02:19

    I think it is clear by now that the problem is because of the emojis that tolower is not able to understand

    #to remove emojis
    dataSet <- iconv(dataSet, 'UTF-8', 'ASCII')
    
    0 讨论(0)
提交回复
热议问题