问题
Based on the question More efficient means of creating a corpus and DTM I've prepared my own method for building a Term Document Matrix from a large corpus which (I hope) do not require Terms x Documents memory.
sparseTDM <- function(vc){
id = unlist(lapply(vc, function(x){x$meta$id}))
content = unlist(lapply(vc, function(x){x$content}))
out = strsplit(content, "\\s", perl = T)
names(out) = id
lev.terms = sort(unique(unlist(out)))
lev.docs = id
v1 = lapply(
out,
function(x, lev) {
sort(as.integer(factor(x, levels = lev, ordered = TRUE)))
},
lev = lev.terms
)
v2 = lapply(
seq_along(v1),
function(i, x, n){
rep(i,length(x[[i]]))
},
x = v1,
n = names(v1)
)
stm = data.frame(i = unlist(v1), j = unlist(v2)) %>%
group_by(i, j) %>%
tally() %>%
ungroup()
tmp = simple_triplet_matrix(
i = stm$i,
j = stm$j,
v = stm$n,
nrow = length(lev.terms),
ncol = length(lev.docs),
dimnames = list(Terms = lev.terms, Docs = lev.docs)
)
as.TermDocumentMatrix(tmp, weighting = weightTf)
}
It slows down at calculation of v1
. It was running for 30 minutes and I stopped it.
I've prepared a small example:
b = paste0("string", 1:200000)
a = sample(b,80)
microbenchmark(
lapply(
list(a=a),
function(x, lev) {
sort(as.integer(factor(x, levels = lev, ordered = TRUE)))
},
lev = b
)
)
Results are:
Unit: milliseconds
expr min lq mean median uq max neval
... 25.80961 28.79981 31.59974 30.79836 33.02461 98.02512 100
Id and content has 126522 elements, Lev.terms has 155591 elements, so it looks that I've stopped processing too early. Since ultimately I'll be working on ~6M documents I need to ask... Is there any way to speed up this fragment of code?
回答1:
For now I've speeded it up replacing
sort(as.integer(factor(x, levels = lev, ordered = TRUE)))
with
ind = which(lev %in% x)
cnt = as.integer(factor(x, levels = lev[ind], ordered = TRUE))
sort(ind[cnt])
Now timings are:
expr min lq mean median uq max neval
... 5.248479 6.202161 6.892609 6.501382 7.313061 10.17205 100
回答2:
I went through many iterations of solving problem in creating quanteda::dfm()
(see the GitHub repo here) and the fastest solution, by far, involves using the data.table
and Matrix
packages to index the documents and tokenised features, counting the features within documents, and plugging the result straight into a sparse matrix like this:
require(data.table)
require(Matrix)
dfm_quanteda <- function(x) {
docIndex <- 1:length(x)
if (is.null(names(x)))
names(docIndex) <- factor(paste("text", 1:length(x), sep="")) else
names(docIndex) <- names(x)
alltokens <- data.table(docIndex = rep(docIndex, sapply(x, length)),
features = unlist(x, use.names = FALSE))
alltokens <- alltokens[features != ""] # if there are any "blank" features
alltokens[, "n":=1L]
alltokens <- alltokens[, by=list(docIndex,features), sum(n)]
uniqueFeatures <- unique(alltokens$features)
uniqueFeatures <- sort(uniqueFeatures)
featureTable <- data.table(featureIndex = 1:length(uniqueFeatures),
features = uniqueFeatures)
setkey(alltokens, features)
setkey(featureTable, features)
alltokens <- alltokens[featureTable, allow.cartesian = TRUE]
alltokens[is.na(docIndex), c("docIndex", "V1") := list(1, 0)]
sparseMatrix(i = alltokens$docIndex,
j = alltokens$featureIndex,
x = alltokens$V1,
dimnames=list(docs=names(docIndex), features=uniqueFeatures))
}
require(quanteda)
str(inaugTexts)
## Named chr [1:57] "Fellow-Citizens of the Senate and of the House of Representatives:\n\nAmong the vicissitudes incident to life no event could ha"| __truncated__ ...
## - attr(*, "names")= chr [1:57] "1789-Washington" "1793-Washington" "1797-Adams" "1801-Jefferson" ...
tokenizedTexts <- tokenize(toLower(inaugTexts), removePunct = TRUE, removeNumbers = TRUE)
system.time(dfm_quanteda(tokenizedTexts))
## user system elapsed
## 0.060 0.005 0.064
That's just a snippet of course but the full source code is easily found on the GitHub repo (dfm-main.R
).
I also encourage you to use the full dfm()
from the package. You can install it from CRAN or the development version using:
devtools::install_github("kbenoit/quanteda")
on your texts to see how that works in terms of performance.
回答3:
Have you tried experimenting with the sort method (algorithm) and specifying quicksort or shell sort?
something like:
sort(as.integer(factor(x, levels = lev, ordered = TRUE)), method=shell)
or:
sort(as.integer(factor(x, levels = lev, ordered = TRUE)), method=quick)
Also, you might try using some intermediate variables to evaluate the nested functions in the event the sort algorithm is re-executing these steps again and again:
foo<-factor(x, levels = lev, ordered = TRUE)
bar<-as.integer(foo)
sort(bar, method=quick)
or
sort(bar)
Good luck!
来源:https://stackoverflow.com/questions/29463464/r-slowly-working-lapply-with-sort-on-ordered-factor