I\'m trying to use the tm package in R to perform some text analysis. I tied the following:
require(tm)
dataSet <- Corpus(DirSource(\'tmp/\'))
dataSet <
Use the following steps:
# First you change your document in .txt format with encoding UFT-8
library(tm)
# Set Your directoryExample ("F:/tmp").
dataSet <- Corpus(DirSource ("/tmp"), readerControl=list(language="english)) # "/tmp" is your directory. You can use any language in place of English whichever allowed by R.
dataSet <- tm_map(dataSet, tolower)
Inspect(dataSet)
I have been running this on Mac and to my frustration,I had to identify the foul record (as these were tweets) to resolve. Since the next time, there is no guarantee of the record being the same, I used the following function
tm_map(yourCorpus, function(x) iconv(x, to='UTF-8-MAC', sub='byte'))
as suggested above.
It worked like a charm
I had the same problem in my mac, solved via below solution.
raw_data <- read.csv(file.choose(), stringsAsFactors = F, encoding="UTF-8")
raw_data$textCol<- iconv(raw_data$textCol, "ASCII", "UTF-8", sub="byte")
data_corpus <- VCorpus(VectorSource(raw_data$textCol))
corpus_clean = tm_map(data_corpus, function(x) iconv(x, to='UTF-8-MAC', sub='byte'))
corpus_clean <- tm_map(data_corpus, content_transformer(tolower))
Chad's solution wasn't working for me. I had this embedded in a function and it was giving an error about iconv
neededing a vector as input. So, I decided to do the conversion before creating the corpus.
myCleanedText <- sapply(myText, function(x) iconv(enc2utf8(x), sub = "byte"))
This is from the tm faq:
it will replace non-convertible bytes in yourCorpus with strings showing their hex codes.
I hope this helps, for me it does.
tm_map(yourCorpus, function(x) iconv(enc2utf8(x), sub = "byte"))
http://tm.r-forge.r-project.org/faq.html
I think it is clear by now that the problem is because of the emojis that tolower is not able to understand
#to remove emojis
dataSet <- iconv(dataSet, 'UTF-8', 'ASCII')