问题
In Python I can do this:
a = np.arange(100)
print id(a) # shows some number
a[:] = np.cumsum(a)
print(id(a)) # shows the same number
What I did here was to replace the contents of a
with its cumsum. The address before and after is the same.
Now let's try it in R:
install.packages('pryr')
library(pryr)
a = 0:99
print(address(a)) # shows some number
a[1:length(a)] = cumsum(a)
print(address(a)) # shows a different number!
The question is how can I overwrite already-allocated memory in R with the results of computations? The lack of this sort of thing seems to be causing significant performance discrepancies when I do vector operations in R vs. Rcpp (writing code in C++ and calling it from R, which lets me avoid unnecessary allocations).
I'm using R 3.1.1 on Ubuntu Linux 10.04 with 24 physical cores and 128 GB of RAM.
回答1:
I did this
> x = 1:5
> .Internal(inspect(x))
@3acfed60 13 INTSXP g0c3 [NAM(1)] (len=5, tl=0) 1,2,3,4,5
> x[] = cumsum(x)
> .Internal(inspect(x))
@3acfed60 13 INTSXP g0c3 [NAM(1)] (len=5, tl=0) 1,3,6,10,15
where the @3acfed60
is the (shared) memory address. The key is NAM(1), which says that there's a single reference to x, hence no need to re-allocate on update.
R uses (currently, I think this will change in the next release) a version of reference counting where an R symbol is reference 0, 1, or more than 1 times; when an object is referenced more than once, its reference count can't be decremented (because 'more than one' could mean 3, hence no way to distinguish between 2 references and 3 references, hence no way to distinguish between one less than 2 and one less than 3). Any attempt at modification needs to duplicate.
Originally I forgot to load pryr and wrote my own address()
> address = function(x) .Internal(inspect(x))
which reveals an interesting problem
> x = 1:5
> address(x)
@4647128 13 INTSXP g0c3 [NAM(2)] (len=5, tl=0) 1,2,3,4,5
> x[] = cumsum(x)
> address(x)
@4647098 13 INTSXP g0c3 [NAM(2)] (len=5, tl=0) 1,3,6,10,15
Notice NAM(2)
, which says that inside the function there are at least two references to x
, i.e., in the global environment, and in the function environment. So touching x
inside a function triggers future duplication, sort of a Heisenberg uncertainty principle. cumsum
(and .Internal
, and length
) are written in a way that allows reference without increment to NAMED; address()
should be revised to have similar behavior (this has now been fixed)
Hmm, when I dig a little deeper I see (I guess it's obvious, in retrospect) that what actually happens is that cumsum(x)
does allocate memory via an S-expression
> x = 1:5
> .Internal(inspect(x))
@3bb1cd0 13 INTSXP g0c3 [NAM(1)] (len=5, tl=0) 1,2,3,4,5
> .Internal(inspect(cumsum(x)))
@43919d0 13 INTSXP g0c3 [] (len=5, tl=0) 1,3,6,10,15
but the assignment x[] <-
associates the new memory with the old location (??). (This seems to be 'as efficient' as data.table, which apparently also creates an S-expression for cumsum, presumably because it's calling cumsum itself!) So mostly I've not been helpful in this answer...
It's not likely that the allocation per se causes performance problems, but rather garbage collection (gcinfo(TRUE)
to see these) of the no longer used memory. I find it useful to launch R with
R --no-save --quiet --min-vsize=2048M --min-nsize=45M
which starts with a larger memory pool hence fewer (initial) garbage collections. It would be useful to analyze your coding style to understand why you find this as the performance bottleneck.
回答2:
Try the data.table
package. It allows for updating values by reference using the :=
operator (as well as using the function set
):
library(data.table)
A <- data.table(a = seq_len(99))
address(A) # [1] "0x108d283f0"
address(A$a) # [1] "0x108e548a0"
options(datatable.verbose=TRUE)
A[, a := cumsum(a)]
# Detected that j uses these columns: a
# Assigning to all 99 rows
# Direct plonk of unnamed RHS, no copy. <~~~ no copy of `A` or `A$a` is made.
address(A) # [1] "0x108d283f0"
address(A$a) # [1] "0x1078f5070"
Note that even though the address of A$a
is different after updating by reference, there's no copy being made here. It's different because it's a full column plonk - meaning the vector cumsum(a)
replaces the current column a
(by reference). (The address you see is the address of cumsum(a)
basically).
来源:https://stackoverflow.com/questions/25379761/can-r-do-operations-like-cumsum-in-place