R - slowly working lapply with sort on ordered factor

问题

Based on the question More efficient means of creating a corpus and DTM I've prepared my own method for building a Term Document Matrix from a large corpus which (I hope) do not require Terms x Documents memory.

sparseTDM <- function(vc){
  id = unlist(lapply(vc, function(x){x$meta$id}))
  content = unlist(lapply(vc, function(x){x$content}))
  out = strsplit(content, "\\s", perl = T)
  names(out) = id
  lev.terms = sort(unique(unlist(out)))
  lev.docs = id

  v1 = lapply(
    out,
    function(x, lev) {
      sort(as.integer(factor(x, levels = lev, ordered = TRUE)))
    },
    lev = lev.terms
  )

  v2 = lapply(
    seq_along(v1),
    function(i, x, n){
      rep(i,length(x[[i]]))
    },
    x = v1,
    n = names(v1)
  )

  stm = data.frame(i = unlist(v1), j = unlist(v2)) %>%
    group_by(i, j) %>%
    tally() %>%
    ungroup()

  tmp = simple_triplet_matrix(
    i = stm$i,
    j = stm$j,
    v = stm$n,
    nrow = length(lev.terms),
    ncol = length(lev.docs),
    dimnames = list(Terms = lev.terms, Docs = lev.docs)
  )

  as.TermDocumentMatrix(tmp, weighting = weightTf)
}

It slows down at calculation of v1. It was running for 30 minutes and I stopped it.

I've prepared a small example:

b = paste0("string", 1:200000)
a = sample(b,80)
microbenchmark(
  lapply(
    list(a=a),
    function(x, lev) {
      sort(as.integer(factor(x, levels = lev, ordered = TRUE)))
    },
    lev = b
  )
)

Results are:

Unit: milliseconds
expr      min       lq      mean   median       uq      max neval
...  25.80961 28.79981  31.59974 30.79836 33.02461 98.02512   100

Id and content has 126522 elements, Lev.terms has 155591 elements, so it looks that I've stopped processing too early. Since ultimately I'll be working on ~6M documents I need to ask... Is there any way to speed up this fragment of code?

回答1:

For now I've speeded it up replacing

sort(as.integer(factor(x, levels = lev, ordered = TRUE)))

with

ind = which(lev %in% x)
cnt = as.integer(factor(x, levels = lev[ind], ordered = TRUE))
sort(ind[cnt])

Now timings are:

expr      min       lq     mean   median       uq      max neval
...  5.248479 6.202161 6.892609 6.501382 7.313061 10.17205   100

回答2:

I went through many iterations of solving problem in creating quanteda::dfm() (see the GitHub repo here) and the fastest solution, by far, involves using the data.table and Matrix packages to index the documents and tokenised features, counting the features within documents, and plugging the result straight into a sparse matrix like this:

require(data.table)
require(Matrix)

dfm_quanteda <- function(x) {
    docIndex <- 1:length(x)
    if (is.null(names(x))) 
        names(docIndex) <- factor(paste("text", 1:length(x), sep="")) else
            names(docIndex) <- names(x)

    alltokens <- data.table(docIndex = rep(docIndex, sapply(x, length)),
                            features = unlist(x, use.names = FALSE))
    alltokens <- alltokens[features != ""]  # if there are any "blank" features
    alltokens[, "n":=1L]
    alltokens <- alltokens[, by=list(docIndex,features), sum(n)]

    uniqueFeatures <- unique(alltokens$features)
    uniqueFeatures <- sort(uniqueFeatures)

    featureTable <- data.table(featureIndex = 1:length(uniqueFeatures),
                               features = uniqueFeatures)
    setkey(alltokens, features)
    setkey(featureTable, features)

    alltokens <- alltokens[featureTable, allow.cartesian = TRUE]
    alltokens[is.na(docIndex), c("docIndex", "V1") := list(1, 0)]

    sparseMatrix(i = alltokens$docIndex, 
                 j = alltokens$featureIndex, 
                 x = alltokens$V1, 
                 dimnames=list(docs=names(docIndex), features=uniqueFeatures))
}

require(quanteda)
str(inaugTexts)
## Named chr [1:57] "Fellow-Citizens of the Senate and of the House of Representatives:\n\nAmong the vicissitudes incident to life no event could ha"| __truncated__ ...
## - attr(*, "names")= chr [1:57] "1789-Washington" "1793-Washington" "1797-Adams" "1801-Jefferson" ...
tokenizedTexts <- tokenize(toLower(inaugTexts), removePunct = TRUE, removeNumbers = TRUE)
system.time(dfm_quanteda(tokenizedTexts))
##  user  system elapsed 
## 0.060   0.005   0.064

That's just a snippet of course but the full source code is easily found on the GitHub repo (dfm-main.R).

I also encourage you to use the full dfm() from the package. You can install it from CRAN or the development version using:

devtools::install_github("kbenoit/quanteda")

on your texts to see how that works in terms of performance.

回答3:

Have you tried experimenting with the sort method (algorithm) and specifying quicksort or shell sort?

something like:

sort(as.integer(factor(x, levels = lev, ordered = TRUE)), method=shell)

or:

sort(as.integer(factor(x, levels = lev, ordered = TRUE)), method=quick)

Also, you might try using some intermediate variables to evaluate the nested functions in the event the sort algorithm is re-executing these steps again and again:

foo<-factor(x, levels = lev, ordered = TRUE)
bar<-as.integer(foo)
sort(bar, method=quick)

sort(bar)

Good luck!

来源：https://stackoverflow.com/questions/29463464/r-slowly-working-lapply-with-sort-on-ordered-factor

标签

text-mining

lapply

corpus

term-document-matrix