R vectorized array data manipulation

半城伤御伤魂 提交于 2021-02-06 02:40:59

问题


I think there will be much more people interested into this subject. I have some specific task to do in the most efficient way. My base data are: - time indices of buy and sell signals - on the diag of time indicies I have ROC (rate of change) between closest buy-sell pairs:

r <- array(data = NA, 
           dim = c(5, 5), 
           dimnames = list(buy_idx = c(1,5,9,12,16), 
                           sell_idx = c(3,7,10,14,19)))
diag(r) <- c(1.04,0.97,1.07,1.21,1.1)

The task is to generate moving compound ROC on every possible window (buy-sell pairs), and the way I'm solving my task currently:

for(i in 2:5){
  r[1:(i-1),i] <- r[1:(i-1),i-1] * r[i,i]
}

Until I'm not looping it somewhere upper, the time of my solution is very acceptable. Is there a way to change this loop to vectorized solution? Are there any good well documented tutorials to learn vectorized type of thinking in R? - it would be much more valuable than one time solution!

edit 20130709:

Next task highly related to previous task/example. Apply tax value on each transaction (tax in % values). Current solution:

diag(r[,]) <- diag(r[,]) * ((1-(tax/100))^2)
for(i in 2:dim(r)[2]){
  r[1:(i-1),i] <- r[1:(i-1),i] * ((1-(tax/100))^(2*(i:2)))
}

Do you know any more efficient way? or more correct if this doesn't handle everything.


回答1:


If d are your diagonal elements, then everywhere j >= i, r[i,j] is prod(d[i:j]), which can also be written prod(d[1:j]) / prod(d[1:(i-1)]). Hence this trick using the outer ratio of the cumulative product:

d <- c(1.04,0.97,1.07,1.21,1.1)
n <- length(d)
p <- cumprod(c(1, d))
r <- t(outer(p, 1/p, "*"))[-n-1, -1]
r[lower.tri(r)] <- NA

Some benchmarks showing that it does better than OP for some (not all) input sizes:

OP <- function(d) {
   r <- diag(d)
   for(i in 2:length(d)){
     r[1:(i-1),i] <- r[1:(i-1),i-1] * r[i,i]
   }
   r
}

flodel <- function(d) {
   n <- length(d)
   p <- cumprod(c(1, d))
   r <- t(outer(p, 1/p, "*"))[-n-1, -1]
   r[lower.tri(r)] <- NA
   r
}

d <- runif(10)
microbenchmark(OP(d), flodel(d))
# Unit: microseconds
#        expr     min       lq   median      uq     max
# 1 flodel(d)  83.028  85.6135  88.4575  90.153 144.111
# 2     OP(d) 115.993 122.0075 123.4730 126.826 206.892

d <- runif(100)
microbenchmark(OP(d), flodel(d))
# Unit: microseconds
#        expr      min       lq    median       uq      max
# 1 flodel(d)  490.819  545.528  549.6095  566.108  684.043
# 2     OP(d) 1227.235 1260.823 1282.9880 1313.264 3913.322

d <- runif(1000)
microbenchmark(OP(d), flodel(d))
# Unit: milliseconds
#        expr      min        lq    median        uq       max
# 1 flodel(d) 97.78687 106.39425 121.13807 133.99502 154.67168
# 2     OP(d) 53.49014  60.10124  72.56427  85.17864  91.89011

edit to answer 20130709 addition:

I'll assume tax is a scalar and let z <- (1- tax/100)^2. Your final result is r multiplied by a matrix of z raised at different powers. What you want to avoid is compute these powers over and over. Here is what I would do:

pow <- 1L + col(r) - row(r)
pow[lower.tri(pow)] <- NA
tax.mult <- (z^(1:n))[pow]
r <- r * tax.mult



回答2:


I have taken a different method which boils down to the use of Reduce. Putting a simple example of Reduce out there for recursive calculations might be worthwhile to someone:

OP's intended result:

> r
       sell_idx
buy_idx    3      7       10       14       19
     1  1.04 1.0088 1.079416 1.306093 1.436703
     5    NA 0.9700 1.037900 1.255859 1.381445
     9    NA     NA 1.070000 1.294700 1.424170
     12   NA     NA       NA 1.210000 1.331000
     16   NA     NA       NA       NA 1.100000

Basic example using the diagonal starting values and Reduce

x <- c(1.04,0.97,1.07,1.21,1.1)
Reduce(prod, tail(x,-1), x[1], accumulate=TRUE)

## gives first row of the answer 
## 1.04 / (1.04*0.97) / 1.07 * (1.04*0.97) etc etc etc

[1] 1.040000 1.008800 1.079416 1.306093 1.436703

Looping across the length of the starting values and adding some NAs gives the full result:

t(
  sapply(1:length(x),
    function(y) c(rep(NA,y-1),Reduce(prod, tail(x,-y), x[y], accumulate=TRUE))
    )
)

The full result:

     [,1]   [,2]     [,3]     [,4]     [,5]
[1,] 1.04 1.0088 1.079416 1.306093 1.436703
[2,]   NA 0.9700 1.037900 1.255859 1.381445
[3,]   NA     NA 1.070000 1.294700 1.424170
[4,]   NA     NA       NA 1.210000 1.331000
[5,]   NA     NA       NA       NA 1.100000

edit

And since the above Reduce fanciness is just equivalent to cumprod, an alternative simpler solution would just be:

rbind(
  cumprod(x),
  t(sapply(1:(length(x)-1),function(y) c(rep(NA,y),cumprod(tail(x,-y)))))
)



回答3:


Going in a different direction from vectorization, here's an approach that yields speed gains (that are very large for small arrays and get to 2-3x range for large ones):

library(inline)
library(Rcpp)

solver_fn = cxxfunction(signature(x = "numeric"), '
  NumericVector diag(x);

  unsigned n = diag.size();
  std::vector<double> result(n*n);

  result[0] = diag[0];

  unsigned col_shift_old = 0, col_shift = 0;
  for (unsigned col = 1; col < n; ++col) {
    col_shift = col * n;
    for (unsigned row = 0; row <= col; ++row) {
      if (result[row + col_shift_old] == 0)
        result[row + col_shift] = diag[col];
      else
        result[row + col_shift] = result[row + col_shift_old] * diag[col];
    }
    col_shift_old = col_shift;
  }

  return NumericVector(result.begin(), result.end());
', plugin = "Rcpp")

compute_matrix = function(d) {
  matrix(solver_fn(d), ncol = length(d))
}

And here's some benchmarks:

op = function(d) {
  r = diag(d)
  for (i in 2:length(d)) {
    r[1:(i-1), i] <- r[1:(i-1), i-1] * r[i,i]
  }
  r
}

d = runif(1e4)
system.time(op(d))
# user  system elapsed
#3.456   1.006   4.462
system.time(compute_matrix(d))
# user  system elapsed
#1.001   0.657   1.660

d = runif(1e3)
system.time(op(d))
# user  system elapsed
# 0.04    0.00    0.04
system.time(compute_matrix(d))
# user  system elapsed
#0.008   0.000   0.009

d = runif(1e2)
system.time(for (i in 1:1000) {op(d)})
# user  system elapsed
#1.075   0.000   1.075
system.time(for (i in 1:1000) {compute_matrix(d)})
# user  system elapsed
#0.075   0.000   0.075

Re 20130709 edit:

Just pass the tax to the C++ function and do the multiplications there. If you understand how the above works the change will be trivial.




回答4:


Disclaimer: I've used this in another answer. So it is going to be a shameless plug.


To answer what seems to be your generic question instead of the example you cited --- how to convert a for loop into a vectorized solution --- the following may be a few useful pointers:

Consider the structure of the object that you are iterating over. There may be different types, for example:

a) Elements of a vector / matrix. b) Rows / Columns of a matrix. c) A dimension of a higher dimensional array. d) Elements of a list (which within themselves may be one of the objects cited above). e) Corresponding elements of multiple lists / vectors.

In each case, the function you employ may be slightly different but the strategy to use is the same. Moreover, learn the apply family. The various *pply functions are based on similar abstraction but differ in what they take as input and what they throw as output.

In the above case-list, for example.

a) Elements of a vector: Look for already existing vectorized solutions (as given above) which are a core strength in R. On top of that consider matrix algebra. Most problems that seem to require loops (or nested loops) can be written as equations in matrix algebra.

b) Rows / Columns of a matrix: Use apply. Use the correct value for the MARGIN argument. Similary for c) for higher dimensional arrays.

d) Use an lapply. If the output you return is a 'simple' structure (a scalar or a vector), you may consider sapply which is simply simplify2array(lapply(...)) and returns an array in the appropriate dimensions.

e) Use mapply. The 'm' can stand for multivariate apply.

Once you have understood the object you are iterating over and the corresponding tool, simplify your problem. Think not of the overall object you are iterating over but one instance of it. For example when iterating over rows of a matrix, forget about the matrix and remember only the row.

Now, write a function (or a lambda) that operates on only the one instance (element) of your iterand and simply `apply' it using the correct member of the *pply family.


Here is my attempt at the problem using cumprod. This hits a sweet spot at around 1000 x 1000 matrices but it returns a list and not a matrix as you expect. However, I am not providing this as a solution since I think your solution in base R is best followed by @eddi's in Rcpp. This is just an example of the process I discussed above:

asb <- function (d) lapply(X=seq.int(from=length(d), to=1),
                           FUN=function (k) cumprod(d[seq_len(k)]))


来源:https://stackoverflow.com/questions/17202421/r-vectorized-array-data-manipulation

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!