R: How to prevent memory overflow when using mgsub in vector mode?

落爺英雄遲暮 提交于 2019-12-02 03:20:17

问题


I have a long vector of characters (e.g. "Hello World", etc), 1.7M rows, and I need to substitute words in them using a map between two vectors, and save the result in same vector. Here's a simple example:

library(qdap)
line = c("one", "two one", "four phones")
e = c("one", "two")
r = c("ONE", "TWO")
line = mgsub(e,r,line)

Result:

[1] "ONE"  "TWO ONE" "four phONEs"

As you can see, each instance of e[j] in line gets substituted with r[j] and only r[j]. It works fine on a relatively small "line" and e->r vocabulary length, but when I run on length(line) = 1700000 and length(e) = 750, I reach the total allocated memory:

Reached total allocation of 7851Mb: see help(memory.size)

Any ideas how to avoid it?


回答1:


I believe you can use fixed = TRUE.

You seem to be concerned with spaces it sounds like... so just add spaces to the ends of all 3 vectors you're working with. To run this whole sequence from ## Start to ## Finish (roughly the size of your data) takes Time difference of 2.906395 secs on 1.7 million strings. The majority of time is at the end with stripping off the extra spaces.

## Recreate data
line <- c("one", "two one", "four phones", "and a capsule", "But here's a caps key")
e <- c("one", "two", "caps")
r <- c("ONE", "TWO", "CAPS")

line <- rep(line, 1700000/length(line))

## Start    
line2 <- paste0(" ", line, " ")
e2 <-  paste0(" ", e, " ")
r2 <- paste0(" ", r, " ")


for (i in seq_along(e2)) {
    line2 <- gsub(e2[i], r2[i], line2, fixed=TRUE)
}

gsub("^\\s|\\s$", "", line2, perl=TRUE)
## Finish

Here qdap's mgsub is not useful. The package was designed for much smaller data. Additionally, the fixed = TRUE is a sensible default because it is so much faster. The point of an add on packages is to improve upon work flow (sometimes field/task specific) through a reconfiguration of available tools. The mgsub function has some error handling too and other niceties that are useful in the analysis of transcripts that make the function hog memory. There's often the trade off between safety + syntactic sugar vs. speed.

Note that just because 2 functions are named in similar ways should not imply anything, particularly if they are found in add on packages. Even functions within base R have differently named and behaving defaults (look at the apply family of functions; this problem is less than ideal but is part of the historical evolution of R). It is incumbent upon you as a user to read documentation not make assumptions.




回答2:


The stringi package provides fast consistent tools for lots of string manipulation stuff:

stri_replace_all_regex(line, paste0("\\b", e, "\\b"), r, vectorize_all = FALSE)

Darn near as fast (fractions of a second different) as the other method and more straight forward.




回答3:


Update to the problem (to Admins: if it doesn't deserve a separate answer - please merge it with the original one). The reason mgsub ran so fast compared to a simple for loop was that in mgsub the parameter fixed = TRUE by default, while in gsub it is FALSE by default! I just discovered it. I'd like to clarify again, that fixed=TRUE is not appropriate for me, as I do not want to replace caps in capsule, but only the whole word caps. I.e. I am forced to paste \\bs to the pattern. Here are three snippets from my code (I tested fixed=TRUE in gsub just to see the time difference, not going to use it).

#This is with mgsub. Now with fixed = FALSE!!
i = mgsub(paste("\\b",orig,"\\b",sep=""),change,i,fixed=FALSE)

#This is with a for loop. fixed=TRUE in one of lines is for test purposes only. Do not use
for(k in seq_along(orig)) {
  i = gsub(paste("\\b",orig[k],"\\b",sep=""),change[k],i)
  #i = gsub(orig[k],change[k],i,fixed=TRUE)
}

Here are the times and memory usage for all three cases on different number of input data:

N     | mgsub, fixed=F   | gsub, fixed=F    | gsub, fixed=T
--------------------------------------------------------------
100k  | 41sec, M > 2.3GB | 37sec, M > 0.9GB | 9sec, M > 0.8GB
200k  | 99sec, M > 4GB   | 74sec, M > 1.1GB | 18sec, M > 1.3GB
300k  | 132sec, M > 5.6GB| 112sec, M > 2.6GB| 28sec, M > 1.6GB 
        + disk involved

Thus, I conclude that for my application when fixed must be FALSE, there's no advantage of using mgsub. In fact, for loop is faster and does not cause memory overflow!

Thanks to all involved. I wish I could give commenters credits, but I don't know how to do it in "Comments"



来源:https://stackoverflow.com/questions/27367914/r-how-to-prevent-memory-overflow-when-using-mgsub-in-vector-mode

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!