How to properly encode UTF-8 txt files for R topic model

问题

Similar issues have been discussed on this forum (e.g. here and here), but I have not found the one that solves my problem, so I apologize for a seemingly similar question.

I have a set of .txt files with UTF-8 encoding (see the screenshot). I am trying to run a topic model in R using tm package. However, despite using encoding = "UTF-8" when creating the corpus, I get obvious problems with encoding. For instance, I get < U+FB01 >scal instead of fiscal, in< U+FB02>uenc instead of influence, not all punctuation is removed and some letters are unrecognizable (e.g. quotations marks are still there in some cases like view” or plan’ or ændring or orphaned quotations marks like “ and ” or zit or years—thus with a dash which should have been removed). These terms also show up in topic distribution over terms. I had some problems with encoding before, but using "encoding = "UTF-8" to create the corpus used to solve the problem. It seem like it does not help this time.

I am on Windows 10 x64, R version 3.6.0 (2019-04-26) , 0.7-7 version of tm package (all up to date). I would greatly appreciate any advice on how to address the problem.

library(tm)
library(beepr)
library(ggplot2)
library(topicmodels)
library(wordcloud)
library(reshape2)
library(dplyr)
library(tidytext)
library(scales)
library(ggthemes)
library(ggrepel)
library(tidyr)


inputdir<-"c:/txtfiles/"
docs<- VCorpus(DirSource(directory = inputdir, encoding ="UTF-8"))

#Preprocessing
docs <-tm_map(docs,content_transformer(tolower))

removeURL <- function(x) gsub("http[^[:space:]]*", "", x)
docs <- tm_map(docs, content_transformer(removeURL))

toSpace <- content_transformer(function(x, pattern) (gsub(pattern, " ", x)))
docs <- tm_map(docs, toSpace, "/")
docs <- tm_map(docs, toSpace, "-")
docs <- tm_map(docs, toSpace, "\\.")
docs <- tm_map(docs, toSpace, "\\-")


docs <- tm_map(docs, removePunctuation)
docs <- tm_map(docs, removeNumbers)
docs <- tm_map(docs, removeWords, stopwords("english"))
docs <- tm_map(docs, stripWhitespace)
docs <- tm_map(docs,stemDocument)

dtm <- DocumentTermMatrix(docs)
freq <- colSums(as.matrix(dtm))
ord <- order(freq, decreasing=TRUE)
write.csv(freq[ord],file=paste("word_freq.csv"))

#Topic model
  ldaOut <-LDA(dtm,k, method="Gibbs", 
               control=list(nstart=nstart, seed = seed, best=best, 
                            burnin = burnin, iter = iter, thin=thin))

Edit: I should add in cse it turns out to be relevant that the txt files were created from PDFs using the following R code:

inputdir <-"c:/pdf/"
myfiles <- list.files(path = inputdir, pattern = "pdf",  full.names = TRUE)
lapply(myfiles, function(i) system(paste('"C:/Users/Delt/AppData/Local/Programs/MiKTeX 2.9/miktex/bin/x64/pdftotext.exe"',
                                         paste0('"', i, '"')), wait = FALSE) )

Two sample txt files can be downloaded here.

回答1:

I found a workaround that seems to work correctly on the 2 example files that you supplied. What you need to do first is NFKD (Compatibility Decomposition). This splits the "ﬁ" orthographic ligature into f and i. Luckily the stringi package can handle this. So before doing all the special text cleaning, you need to apply the function stringi::stri_trans_nfkd. You can do this in the preprocessing step just after (or before) the tolower step.

Do read the documentation for this function and the references.

library(tm)
docs<- VCorpus(DirSource(directory = inputdir, encoding ="UTF-8"))

#Preprocessing
docs <-tm_map(docs,content_transformer(tolower))

# use stringi to fix all the orthographic ligature issues 
docs <- tm_map(docs, content_transformer(stringi::stri_trans_nfkd))

toSpace <- content_transformer(function(x, pattern) (gsub(pattern, " ", x)))

# add following two lines as well to remove special quotes.
docs <- tm_map(docs, toSpace, "“")
docs <- tm_map(docs, toSpace, "‘")

....
rest of process
.....

来源：https://stackoverflow.com/questions/61463661/how-to-properly-encode-utf-8-txt-files-for-r-topic-model

标签

encoding

utf-8

nlp

topic-modeling