问题
I have a long vector of characters (e.g. "Hello World", etc), 1.7M rows, and I need to substitute words in them using a map between two vectors, and save the result in same vector. Here's a simple example:
library(qdap)
line = c("one", "two one", "four phones")
e = c("one", "two")
r = c("ONE", "TWO")
line = mgsub(e,r,line)
Result:
[1] "ONE" "TWO ONE" "four phONEs"
As you can see, each instance of e[j]
in line gets substituted with r[j]
and only r[j]
.
It works fine on a relatively small "line" and e->r
vocabulary length, but when I run on length(line) = 1700000
and length(e) = 750
, I reach the total allocated memory:
Reached total allocation of 7851Mb: see help(memory.size)
Any ideas how to avoid it?
回答1:
I believe you can use fixed = TRUE
.
You seem to be concerned with spaces it sounds like... so just add spaces to the ends of all 3 vectors you're working with. To run this whole sequence from ## Start
to ## Finish
(roughly the size of your data) takes Time difference of 2.906395 secs
on 1.7 million strings. The majority of time is at the end with stripping off the extra spaces.
## Recreate data
line <- c("one", "two one", "four phones", "and a capsule", "But here's a caps key")
e <- c("one", "two", "caps")
r <- c("ONE", "TWO", "CAPS")
line <- rep(line, 1700000/length(line))
## Start
line2 <- paste0(" ", line, " ")
e2 <- paste0(" ", e, " ")
r2 <- paste0(" ", r, " ")
for (i in seq_along(e2)) {
line2 <- gsub(e2[i], r2[i], line2, fixed=TRUE)
}
gsub("^\\s|\\s$", "", line2, perl=TRUE)
## Finish
Here qdap's mgsub
is not useful. The package was designed for much smaller data. Additionally, the fixed = TRUE
is a sensible default because it is so much faster. The point of an add on packages is to improve upon work flow (sometimes field/task specific) through a reconfiguration of available tools. The mgsub
function has some error handling too and other niceties that are useful in the analysis of transcripts that make the function hog memory. There's often the trade off between safety + syntactic sugar vs. speed.
Note that just because 2 functions are named in similar ways should not imply anything, particularly if they are found in add on packages. Even functions within base R have differently named and behaving defaults (look at the apply
family of functions; this problem is less than ideal but is part of the historical evolution of R). It is incumbent upon you as a user to read documentation not make assumptions.
回答2:
The stringi package provides fast consistent tools for lots of string manipulation stuff:
stri_replace_all_regex(line, paste0("\\b", e, "\\b"), r, vectorize_all = FALSE)
Darn near as fast (fractions of a second different) as the other method and more straight forward.
回答3:
Update to the problem (to Admins: if it doesn't deserve a separate answer - please merge it with the original one). The reason mgsub
ran so fast compared to a simple for loop was that in mgsub
the parameter fixed = TRUE
by default, while in gsub
it is FALSE
by default! I just discovered it.
I'd like to clarify again, that fixed=TRUE
is not appropriate for me, as I do not want to replace caps
in capsule
, but only the whole word caps
. I.e. I am forced to paste \\b
s to the pattern. Here are three snippets from my code (I tested fixed=TRUE
in gsub
just to see the time difference, not going to use it).
#This is with mgsub. Now with fixed = FALSE!!
i = mgsub(paste("\\b",orig,"\\b",sep=""),change,i,fixed=FALSE)
#This is with a for loop. fixed=TRUE in one of lines is for test purposes only. Do not use
for(k in seq_along(orig)) {
i = gsub(paste("\\b",orig[k],"\\b",sep=""),change[k],i)
#i = gsub(orig[k],change[k],i,fixed=TRUE)
}
Here are the times and memory usage for all three cases on different number of input data:
N | mgsub, fixed=F | gsub, fixed=F | gsub, fixed=T
--------------------------------------------------------------
100k | 41sec, M > 2.3GB | 37sec, M > 0.9GB | 9sec, M > 0.8GB
200k | 99sec, M > 4GB | 74sec, M > 1.1GB | 18sec, M > 1.3GB
300k | 132sec, M > 5.6GB| 112sec, M > 2.6GB| 28sec, M > 1.6GB
+ disk involved
Thus, I conclude that for my application when fixed
must be FALSE
, there's no advantage of using mgsub
. In fact, for
loop is faster and does not cause memory overflow!
Thanks to all involved. I wish I could give commenters credits, but I don't know how to do it in "Comments"
来源:https://stackoverflow.com/questions/27367914/r-how-to-prevent-memory-overflow-when-using-mgsub-in-vector-mode