R: inspect Document Term Matrix results in Error: Repeated indices currently not allowed

倾然丶 夕夏残阳落幕 提交于 2019-12-11 09:29:52


I have the following dummy data:

final6 <- data.frame(docname = paste0("doc", 1:6),
                  articles = c("Catalonia independence in matter of days",
                               "Anger over Johnson Libya bodies comment",
                               "Man admits frenzied mum and son murder",
                               "The headache that changed my life",
                               "Las Vegas killer sick, demented - Trump",
                               "Instagram baby photo scammer banned")

And I want to create a DocumentTermMatrix with reference to document names (that I could later link to the original article text). To achieve this, I follow instruction from this post:

myReader <- readTabular(mapping=list(content="articles", id="docname"))
text_corpus <- VCorpus(DataframeSource(final6), readerControl = list(reader = myReader))

# define function that replaces ounctuation with spaces 
replacePunctuation <- content_transformer(function(x) {return (gsub("[[:punct:]]"," ", x))}) # replaces punctuation with empty spaces

# remove customised words 
myWords <- c("ok", "chat", 'okay', 'day', 'today', "might", "bye", "hello", "thank", "you", "please", "sorry", "hello", "hi")

# clean text 
cleantext <- function(corpus){
  clean_corpus <- tm_map(corpus, removeNumbers)
  clean_corpus <- tm_map(clean_corpus, tolower)
  clean_corpus <- tm_map(clean_corpus, PlainTextDocument)
  clean_corpus <- tm_map(clean_corpus, replacePunctuation)
  clean_corpus <- tm_map(clean_corpus, removePunctuation)
  clean_corpus <- tm_map(clean_corpus, removeWords, c(stopwords("english"), myWords, top_names))
  clean_corpus <- tm_map(clean_corpus, stripWhitespace)
  clean_corpus <- tm_map(clean_corpus, stemDocument, language = "english")


clean_corpus <- cleantext(text_corpus) 

# create dtm
chat_DTM <- DocumentTermMatrix(clean_corpus, control = list(wordLengths = c(3, Inf)))

Now, when I want to inspect the matrix, I get the error:


Error in [.simple_triplet_matrix(x, docs, terms) : Repeated indices currently not allowed.

To be fair, this error occurs even if I create a corpus based on text only and without passing doc id as an attribute. Any ideas what causes the problem?


The problem was with the PlainTextDocument function that removes meta data from corpus. If you modify clean_text function as follows, this results in the clean DTM that can be inspected without any errors returned:

cleantext <- function(corpus){
  clean_corpus <- tm_map(corpus, removeNumbers)
  clean_corpus <- tm_map(clean_corpus, content_transformer(tolower)) #!! modified
  #clean_corpus <- tm_map(clean_corpus, PlainTextDocument) ### !!!! PlainTextDocument function erases metadata from corpus = document id! So this needs to be erased
  clean_corpus <- tm_map(clean_corpus, replacePunctuation)
  clean_corpus <- tm_map(clean_corpus, removePunctuation)
  clean_corpus <- tm_map(clean_corpus, removeWords, c(stopwords("english"), myWords, top_names))
  clean_corpus <- tm_map(clean_corpus, stripWhitespace)
  clean_corpus <- tm_map(clean_corpus, stemDocument, language = "english")


clean_corpus <- cleantext(text_corpus)

chat_DTM2 <- DocumentTermMatrix(clean_corpus)

The answer was inspired by this solution. Thanks!


You might get a similar error if creating a directory source by using DirSource(recursive=T, ...), and 2 or more files, in different paths, have the same name.

In this case, a workaround is:

ds   <- DirSource(".", recursive=T)
ovid <- VCorpus(ds)
names(ovid) <- ds$filelist

