问题
loading a bunch of documents using tm Corpus i need to specify encoding.
All documents are UTF-8 encoded. If openend via text editor content is ok but corpus contents is full of strange symbols (indicioâ., ‘sœs....) Source text is in spanish. ES_es
library(tm)
cname <- file.path("C:", "Users", "john", "Documents", "texts")
docs <- Corpus(DirSource(cname), encoding ="UTF-8")
> Error in Corpus(DirSource(cname), encoding = "UTF-8") :
unused argument (encoding = "UTF-8")
EDITED:
Getting str(documents[1]) from corpus I've noticed:
.. ..$ language : chr "en"
How can I specify, for instance "UTF-8", "Latin1" or any other encoding to avoid strange symbols?
Regards
回答1:
From the "C:" it's clear you are using Windows, which assumes a Windows-1252 encoding (on most systems) rather than UTF-8. You could try reading the files in as character and then setting Encoding(myCharVector) <- "UTF-8"
. If the input encoding was UTF-8 this should cause your system to recognise and display the UTF-8 characters properly.
Alternatively this will work, although it also makes tm unnecessary:
require(quanteda)
docs <- corpus(textfile("C:/Users/john/Documents/texts/*.txt", encoding = "UTF-8"))
Then you can see the texts using for example:
cat(texts(docs)[1:2])
They should have the encoding bit set and display properly. Then if you prefer, you can get these into tm using:
docsTM <- Corpus(VectorSource(texts(docs)))
回答2:
Seems that there's no need of using quanteda package (besides some odd behaviour losing file names when converting to TM VCorpora)
files <- DirSource(directory = "C:/Users/john/Documents/",encoding ="UTF-8" )
mycorpus<- VCorpus(x=files)
Now encoding is correct.
来源:https://stackoverflow.com/questions/37278333/set-encoding-for-reading-text-files-into-tm-corpora