问题
I think there will be much more people interested into this subject. I have some specific task to do in the most efficient way. My base data are: - time indices of buy and sell signals - on the diag of time indicies I have ROC (rate of change) between closest buy-sell pairs:
r <- array(data = NA,
dim = c(5, 5),
dimnames = list(buy_idx = c(1,5,9,12,16),
sell_idx = c(3,7,10,14,19)))
diag(r) <- c(1.04,0.97,1.07,1.21,1.1)
The task is to generate moving compound ROC on every possible window (buy-sell pairs), and the way I'm solving my task currently:
for(i in 2:5){
r[1:(i-1),i] <- r[1:(i-1),i-1] * r[i,i]
}
Until I'm not looping it somewhere upper, the time of my solution is very acceptable. Is there a way to change this loop to vectorized solution? Are there any good well documented tutorials to learn vectorized type of thinking in R? - it would be much more valuable than one time solution!
edit 20130709:
Next task highly related to previous task/example. Apply tax value on each transaction (tax in % values). Current solution:
diag(r[,]) <- diag(r[,]) * ((1-(tax/100))^2)
for(i in 2:dim(r)[2]){
r[1:(i-1),i] <- r[1:(i-1),i] * ((1-(tax/100))^(2*(i:2)))
}
Do you know any more efficient way? or more correct if this doesn't handle everything.
回答1:
If d
are your diagonal elements, then everywhere j >= i
, r[i,j]
is prod(d[i:j])
, which can also be written prod(d[1:j]) / prod(d[1:(i-1)])
. Hence this trick using the outer
ratio of the cumulative product:
d <- c(1.04,0.97,1.07,1.21,1.1)
n <- length(d)
p <- cumprod(c(1, d))
r <- t(outer(p, 1/p, "*"))[-n-1, -1]
r[lower.tri(r)] <- NA
Some benchmarks showing that it does better than OP for some (not all) input sizes:
OP <- function(d) {
r <- diag(d)
for(i in 2:length(d)){
r[1:(i-1),i] <- r[1:(i-1),i-1] * r[i,i]
}
r
}
flodel <- function(d) {
n <- length(d)
p <- cumprod(c(1, d))
r <- t(outer(p, 1/p, "*"))[-n-1, -1]
r[lower.tri(r)] <- NA
r
}
d <- runif(10)
microbenchmark(OP(d), flodel(d))
# Unit: microseconds
# expr min lq median uq max
# 1 flodel(d) 83.028 85.6135 88.4575 90.153 144.111
# 2 OP(d) 115.993 122.0075 123.4730 126.826 206.892
d <- runif(100)
microbenchmark(OP(d), flodel(d))
# Unit: microseconds
# expr min lq median uq max
# 1 flodel(d) 490.819 545.528 549.6095 566.108 684.043
# 2 OP(d) 1227.235 1260.823 1282.9880 1313.264 3913.322
d <- runif(1000)
microbenchmark(OP(d), flodel(d))
# Unit: milliseconds
# expr min lq median uq max
# 1 flodel(d) 97.78687 106.39425 121.13807 133.99502 154.67168
# 2 OP(d) 53.49014 60.10124 72.56427 85.17864 91.89011
edit to answer 20130709 addition:
I'll assume tax
is a scalar and let z <- (1- tax/100)^2
. Your final result is r
multiplied by a matrix of z
raised at different powers. What you want to avoid is compute these powers over and over. Here is what I would do:
pow <- 1L + col(r) - row(r)
pow[lower.tri(pow)] <- NA
tax.mult <- (z^(1:n))[pow]
r <- r * tax.mult
回答2:
I have taken a different method which boils down to the use of Reduce
. Putting a simple example of Reduce
out there for recursive calculations might be worthwhile to someone:
OP's intended result:
> r
sell_idx
buy_idx 3 7 10 14 19
1 1.04 1.0088 1.079416 1.306093 1.436703
5 NA 0.9700 1.037900 1.255859 1.381445
9 NA NA 1.070000 1.294700 1.424170
12 NA NA NA 1.210000 1.331000
16 NA NA NA NA 1.100000
Basic example using the diagonal starting values and Reduce
x <- c(1.04,0.97,1.07,1.21,1.1)
Reduce(prod, tail(x,-1), x[1], accumulate=TRUE)
## gives first row of the answer
## 1.04 / (1.04*0.97) / 1.07 * (1.04*0.97) etc etc etc
[1] 1.040000 1.008800 1.079416 1.306093 1.436703
Looping across the length of the starting values and adding some NAs gives the full result:
t(
sapply(1:length(x),
function(y) c(rep(NA,y-1),Reduce(prod, tail(x,-y), x[y], accumulate=TRUE))
)
)
The full result:
[,1] [,2] [,3] [,4] [,5]
[1,] 1.04 1.0088 1.079416 1.306093 1.436703
[2,] NA 0.9700 1.037900 1.255859 1.381445
[3,] NA NA 1.070000 1.294700 1.424170
[4,] NA NA NA 1.210000 1.331000
[5,] NA NA NA NA 1.100000
edit
And since the above Reduce
fanciness is just equivalent to cumprod
, an alternative simpler solution would just be:
rbind(
cumprod(x),
t(sapply(1:(length(x)-1),function(y) c(rep(NA,y),cumprod(tail(x,-y)))))
)
回答3:
Going in a different direction from vectorization, here's an approach that yields speed gains (that are very large for small arrays and get to 2-3x range for large ones):
library(inline)
library(Rcpp)
solver_fn = cxxfunction(signature(x = "numeric"), '
NumericVector diag(x);
unsigned n = diag.size();
std::vector<double> result(n*n);
result[0] = diag[0];
unsigned col_shift_old = 0, col_shift = 0;
for (unsigned col = 1; col < n; ++col) {
col_shift = col * n;
for (unsigned row = 0; row <= col; ++row) {
if (result[row + col_shift_old] == 0)
result[row + col_shift] = diag[col];
else
result[row + col_shift] = result[row + col_shift_old] * diag[col];
}
col_shift_old = col_shift;
}
return NumericVector(result.begin(), result.end());
', plugin = "Rcpp")
compute_matrix = function(d) {
matrix(solver_fn(d), ncol = length(d))
}
And here's some benchmarks:
op = function(d) {
r = diag(d)
for (i in 2:length(d)) {
r[1:(i-1), i] <- r[1:(i-1), i-1] * r[i,i]
}
r
}
d = runif(1e4)
system.time(op(d))
# user system elapsed
#3.456 1.006 4.462
system.time(compute_matrix(d))
# user system elapsed
#1.001 0.657 1.660
d = runif(1e3)
system.time(op(d))
# user system elapsed
# 0.04 0.00 0.04
system.time(compute_matrix(d))
# user system elapsed
#0.008 0.000 0.009
d = runif(1e2)
system.time(for (i in 1:1000) {op(d)})
# user system elapsed
#1.075 0.000 1.075
system.time(for (i in 1:1000) {compute_matrix(d)})
# user system elapsed
#0.075 0.000 0.075
Re 20130709 edit:
Just pass the tax
to the C++
function and do the multiplications there. If you understand how the above works the change will be trivial.
回答4:
Disclaimer: I've used this in another answer. So it is going to be a shameless plug.
To answer what seems to be your generic question instead of the example you cited --- how to convert a for loop into a vectorized solution --- the following may be a few useful pointers:
Consider the structure of the object that you are iterating over. There may be different types, for example:
a) Elements of a vector / matrix. b) Rows / Columns of a matrix. c) A dimension of a higher dimensional array. d) Elements of a list (which within themselves may be one of the objects cited above). e) Corresponding elements of multiple lists / vectors.
In each case, the function you employ may be slightly different but the strategy to use is the same. Moreover, learn the apply family. The various *pply functions are based on similar abstraction but differ in what they take as input and what they throw as output.
In the above case-list, for example.
a) Elements of a vector: Look for already existing vectorized solutions (as given above) which are a core strength in R. On top of that consider matrix algebra. Most problems that seem to require loops (or nested loops) can be written as equations in matrix algebra.
b) Rows / Columns of a matrix: Use apply. Use the correct value for the MARGIN argument. Similary for c) for higher dimensional arrays.
d) Use an lapply. If the output you return is a 'simple' structure (a scalar or a vector), you may consider sapply which is simply simplify2array(lapply(...)) and returns an array in the appropriate dimensions.
e) Use mapply. The 'm' can stand for multivariate apply.
Once you have understood the object you are iterating over and the corresponding tool, simplify your problem. Think not of the overall object you are iterating over but one instance of it. For example when iterating over rows of a matrix, forget about the matrix and remember only the row.
Now, write a function (or a lambda) that operates on only the one instance (element) of your iterand and simply `apply' it using the correct member of the *pply family.
Here is my attempt at the problem using cumprod
. This hits a sweet spot at around 1000 x 1000 matrices but it returns a list and not a matrix as you expect. However, I am not providing this as a solution since I think your solution in base R is best followed by @eddi's in Rcpp. This is just an example of the process I discussed above:
asb <- function (d) lapply(X=seq.int(from=length(d), to=1),
FUN=function (k) cumprod(d[seq_len(k)]))
来源:https://stackoverflow.com/questions/17202421/r-vectorized-array-data-manipulation