Why does “vectorizing” this simple R loop give a different result?

前端未结

关注

 4  935

故里飘歌 2021-02-12 21:54

Perhaps a very dumb question.

I am trying to \"vectorize\" the following loop:

set.seed(0)
x <- round(runif(10), 2)
# [1] 0.90 0.27 0.37 0.57 0.91 0.2


      
      
        
          4条回答        

        
                    
            
            
                         
                
              
              
                
                   一向
                                             
                
                
                (楼主)
            
              
              
                2021-02-12 22:06
              

            
            
                        
warm-up

As a warm-up, consider two simpler examples.

## example 1
x <- 1:11
for (i in 1:10) x[i] <- x[i + 1]
x
# [1]  2  3  4  5  6  7  8  9 10 11 11

x <- 1:11
x[1:10] <- x[2:11]
x
# [1]  2  3  4  5  6  7  8  9 10 11 11

## example 2
x <- 1:11
for (i in 1:10) x[i + 1] <- x[i]
x
# [1] 1 1 1 1 1 1 1 1 1 1 1

x <- 1:11
x[2:11] <- x[1:10]
x
# [1]  1  1  2  3  4  5  6  7  8  9 10


"Vectorization" is successful in the 1st example but not the 2nd. Why?

Here is prudent analysis. "Vectorization" starts by loop unrolling, then executes several instructions in parallel. Whether a loop can be "vectorized" depends on the data dependency carried by the loop.

Unrolling the loop in example 1 gives

x[1]  <- x[2]
x[2]  <- x[3]
x[3]  <- x[4]
x[4]  <- x[5]
x[5]  <- x[6]
x[6]  <- x[7]
x[7]  <- x[8]
x[8]  <- x[9]
x[9]  <- x[10]
x[10] <- x[11]


Executing these instructions one by one and executing them simultaneously give identical result. So this loop can be "vectorized".

The loop in example 2 is

x[2]  <- x[1]
x[3]  <- x[2]
x[4]  <- x[3]
x[5]  <- x[4]
x[6]  <- x[5]
x[7]  <- x[6]
x[8]  <- x[7]
x[9]  <- x[8]
x[10] <- x[9]
x[11] <- x[10]


Unfortunately, executing these instructions one by one and executing them simultaneously would not give identical result. For example, when executing them one by one, x[2] is modified in the 1st instruction, then this modified value is passed to x[3] in the 2nd instruction. So x[3] would have the same value as x[1]. However, in parallel execution, x[3] equals x[2]. As the result, this loop can not be "vectorized".

In "vectorization" theory,


Example 1 has a "write-after-read" dependency in data: x[i] is modified after it is read;
Example 2 has a "read-after-write" dependency in data: x[i] is read after it is modified.


A loop with "write-after-read" data dependency can be "vectorized", while a loop with "read-after-write" data dependency can not.



in depth

Perhaps many people have been confused by now. "Vectorization" is a "parallel-processing"?

Yes. In 1960's when people wondered what kind of parallel processing computer be designed for high performance computing, Flynn classified the design ideas into 4 types. The category "SIMD" (single instruction, multiple data) is corned "vectorization", and a computer with "SIMD" cabability is called a "vector processor" or "array processor".

In 1960's there were not many programming languages. People wrote assembly (then FORTRAN when a compiler was invented) to program CPU registers directly. A "SIMD" computer is able to load multiple data into a vector register with a single instruction and do the same arithmetic on those data at the same time. So data processing is indeed parallel. Consider our example 1 again. Suppose a vector register can hold two vector elements, then the loop can be executed with 5 iterations using vector processing rather than 10 iterations as in scalar processing.

reg <- x[2:3]  ## load vector register
x[1:2] <- reg  ## store vector register
-------------
reg <- x[4:5]  ## load vector register
x[3:4] <- reg  ## store vector register
-------------
reg <- x[6:7]  ## load vector register
x[5:6] <- reg  ## store vector register
-------------
reg <- x[8:9]  ## load vector register
x[7:8] <- reg  ## store vector register
-------------
reg <- x[10:11] ## load vector register
x[9:10] <- reg  ## store vector register


Today there are many programming languages, like R. "Vectorization" no longer unambiguously refers to "SIMD". R is not a language where we can program CPU registers. The "vectorization" in R is just an analogy to "SIMD". In a previous Q & A: Does the term "vectorization" mean different things in different contexts? I have tried to explain this. The following map illustrates how this analogy is made:

single (assembly) instruction    -> single R instruction
CPU vector registers             -> temporary vectors
parallel processing in registers -> C/C++/FORTRAN loops with temporary vectors


So, the R "vectorization" of the loop in example 1 is something like

## the C-level loop is implemented by function "["
tmp <- x[2:11]  ## load data into a temporary vector
x[1:10] <- tmp  ## fill temporary vector into x


Most of the time we just do

x[1:10] <- x[2:10]


without explicitly assigning the temporary vector to a variable. The temporary memory block created is not pointed to by any R variable, and is therefore subject to garbage collection.



a complete picture

In the above, "vectorization" is not introduced with the simplest example. Very often, "vectorization" is introduced with something like

a[1] <- b[1] + c[1]
a[2] <- b[2] + c[2]
a[3] <- b[3] + c[3]
a[4] <- b[4] + c[4]


where a, b and c are not aliased in memory, that is, the memory blocks storing vectors a, b and c do not overlap. This is an ideal case, as no memory aliasing implies no data dependency.

Apart from "data dependency", there is also "control dependency", that is, dealing with "if ... else ..." in "vectorization". However, for time and space reason I will not elaborate on this issue.



back to the example in the question

Now it is time to investigate the loop in the question.

set.seed(0)
x <- round(runif(10), 2)
sig <- sample.int(10)
# [1]  1  2  9  5  3  4  8  6  7 10
for (i in seq_along(sig)) x[i] <- x[sig[i]]


Unrolling the loop gives

x[1]  <- x[1]
x[2]  <- x[2]
x[3]  <- x[9]   ## 3rd instruction
x[4]  <- x[5]
x[5]  <- x[3]   ## 5th instruction
x[6]  <- x[4]
x[7]  <- x[8]
x[8]  <- x[6]
x[9]  <- x[7]
x[10] <- x[10]


There is "read-after-write" data dependency between the 3rd and the 5th instruction, so the loop can not be "vectorized" (see Remark 1).

Well then, what does x[] <- x[sig] do? Let's first explicitly write out the temporary vector:

tmp <- x[sig]
x[] <- tmp


Since "[" is called twice, there are actually two C-level loops behind this "vectorized" code:

tmp[1]  <- x[1]
tmp[2]  <- x[2]
tmp[3]  <- x[9]
tmp[4]  <- x[5]
tmp[5]  <- x[3]
tmp[6]  <- x[4]
tmp[7]  <- x[8]
tmp[8]  <- x[6]
tmp[9]  <- x[7]
tmp[10] <- x[10]

x[1]  <- tmp[1]
x[2]  <- tmp[2]
x[3]  <- tmp[3]
x[4]  <- tmp[4]
x[5]  <- tmp[5]
x[6]  <- tmp[6]
x[7]  <- tmp[7]
x[8]  <- tmp[8]
x[9]  <- tmp[9]
x[10] <- tmp[10]


So x[] <- x[sig] is equivalent to

for (i in 1:10) tmp[i] <- x[sig[i]]
for (i in 1:10) x[i] <- tmp[i]
rm(tmp); gc()


which is not at all the original loop given in the question.



Remark 1

If implementing the loop in Rcpp is seen as a "vectorization" then let it be. But there is no chance to further "vectorize" the C / C++ loop with "SIMD".



Remark 2

This Q & A is motivated by this Q & A. OP originally presented a loop

for (i in 1:num) {
  for (j in 1:num) {
    mat[i, j] <- mat[i, mat[j, "rm"]]
  }
}


It is tempting to "vectorize" it as

mat[1:num, 1:num] <- mat[1:num, mat[1:num, "rm"]]


but it is potentially wrong. Later OP changed the loop to

for (i in 1:num) {
  for (j in 1:num) {
    mat[i, j] <- mat[i, 1 + num + mat[j, "rm"]]
  }
}


which eliminates the memory aliasing issue, because the columns to be replaced are the first num columns, while the columns to be looked up are after the first num columns.



Remark 3

I got some comments regarding whether the loop in the question is making "in-place" modification of x. Yes, it is. We can use tracemem:

set.seed(0)
x <- round(runif(10), 2)
sig <- sample.int(10)
tracemem(x)
#[1] "<0x28f7340>"
for (i in seq_along(sig)) x[i] <- x[sig[i]]
tracemem(x)
#[1] "<0x28f7340>"


My R session has allocated a memory block pointed by address <0x28f7340> for x and you may see a different value when you run the code. However, the output of tracemem will not change after the loop, which means that no copy of x is made. So the loop is indeed doing "in-place" modification without using extra memory.

However, the loop is not doing "in-place" permutation. "In-place" permutation is a more complicated operation. Not only elements of x need be swapped along the loop, elements of sig also need be swapped (and in the end, sig would be 1:10).
    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它4个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复