Avoid copying the whole vector when replacing an element (a[1] <- 2)

后端 未结 2 1151
情深已故
情深已故 2021-01-04 12:15

When replacing an element in a vector such as

a <- 1:1000000
a[1] <- 2

R copies the whole vector, replaces the element in the new vec

相关标签:
2条回答
  • 2021-01-04 12:41

    You can do this with the ff package which is on CRAN. Using ff, your data is stored on disk and indexing will only affect that specific element you are indexing

    require(ff)
    a <- ff(1:1000000)
    a[1] <- 2
    

    For info. These are timings, so it is a lot faster for your toy case.

    require(ff)
    a <- 1:100000000
    b <- ff(a)
    system.time(a[1] <- 2)
     user  system elapsed 
    0.440   0.592   1.056 
    system.time(b[1] <- 2)
     user  system elapsed 
    0.004   0.000   0.001 
    
    0 讨论(0)
  • 2021-01-04 12:54

    The tracemem function (R needs to be compiled to support it) provides an indication of when copying occurs. Here's what you do

    > a <- 1:1000000; tracemem(a)
    [1] "<0x7f791b39e010>"
    > a[1] = 2
    tracemem[0x7f791b39e010 -> 0x7f791a9d4010]: 
    

    and indeed there's a copy. But this is because you're coercing a from an integer vector (1:1000000 creates a sequence of integers) to a numeric vector (because 2 is a numeric value, and R coerces to a common type). If instead you update your integer vector with an integer value, or a numeric vector with a numeric value, there is no copying

    > a <- 1:1000000; tracemem(a)
    [1] "<0x7f791a4ef010>"
    > a[1] = 2L
    > a = c(1, 2, 3); tracemem(a)
    [1] "<0x5180470>"
    > a[1] = 2
    >
    

    A little bit further insight comes from understanding at a superficial level how R's memory management works. Each allocation has a NAMED level associated with it. NAMED=0 or 1 indicates that there is at most 1 symbol that refers to it; it is therefore safe to copy in place. NAMED=2 means that there is, or has been, at least 2 symbols pointing to the same location, and that any attempt to update the value requires a duplication to preserve R's illusion of 'copy on change'. The following reveals some of the internal structure of a, including that it of type INTSXP (integer) with NAM(1) (NAMED level 1) and that it's being TRaced. Hence updating (with an integer!) does not require a copy.

    > a = 1:10; tracemem(a); .Internal(inspect(a))
    [1] "<0x5170818>"
    @5170818 13 INTSXP g0c4 [NAM(1),TR] (len=10, tl=0) 1,2,3,4,5,...
    > a[1] = 2L
    > 
    

    On the other had, here two symbols refer to the location in memory, hence NAMED is 2 and a copy is required

    > a = b = 1:10; tracemem(a); .Internal(inspect(a))
    [1] "<0x576d1a0>"
    @576d1a0 13 INTSXP g0c4 [NAM(2),TR] (len=10, tl=0) 1,2,3,4,5,...
    > a[1] = 2L
    tracemem[0x576d1a0 -> 0x576d148]: 
    

    It is difficult to reason about NAMED, so at some level these types of games have a level of futility about them.

    inspect returns other information. Each R type is represented internally as an 'SEXP' (S-expression) type. These are enumerate, and the 13th SEXP type is an integer SEXP -- hence 13 INTSXP. Check out .Internal(inspect(...)) for a numeric vector, character vector, or even function .Internal(inspect(function() {})).

    R manages memory by periodically running a 'garbage collector' that checks to see if memory is currently referenced; if it is not, then it is reclaimed for use by another symbol. The garbage collector is 'generational', which means that recently allocated memory is checked for reclamation more frequently than older memory (this is because, empirically, variables tend to have a short half-life, e.g., for the duration of a function call, so recently allocated memory is more likely to be available for reclamation than memory that has been in use for a longer time). The g0c4 and similar annotations are providing information about the generation the SEXP belongs to.

    The TR represents a 'bit' set in the SEXP to indicate that the variable is being traced; it was set when we said tracemem(a).

    Some of these topics are discussed in the documentation of R's internal implementation RShowDoc("R-ints") and in the C header file Rinternals.h.

    0 讨论(0)
提交回复
热议问题