Fastest way to calculate many regressions in python?

问题

I think I have a pretty reasonable idea on how to do go about accomplishing this, but I'm not 100% sure on all of the steps. This question is mostly intended as a sanity check to ensure that I'm doing this in the most efficient way, and that my math is actually sound (since my statistics knowledge is not completely perfect).

Anyways, some explanation about what I'm trying to do:

I have a lot of time series data that I would like to perform some linear regressions on. In particular, I have roughly 2000 observations on 500 different variables. For each variable, I need to perform a regression using two explanatory variables (two additional vectors of roughly 2000 observations). So for each of 500 different Y's, I would need to find a and b in the following regression Y = aX_1 + bX_2 + e.

Up until this point, I have been using the OLS function in the statsmodels package to perform my regressions. However, as far as I can tell, if I wanted to use the statsmodels package to accomplish my problem, I would have to call it hundreds of times, which just seems generally inefficient.

So instead, I decided to revisit some statistics that I haven't really touched in a long time. If my knowledge is still correct, I can put all of my observations into one large Y matrix that is roughly 2000 x 500. I can then stick my explanatory variables into an X matrix that is roughly 2000 x 2, and get the results of all 500 of my regressions by calculating (X'Y)/(X'X). If I do this using basic numpy stuff (matrix multiplication using * and inverses using matrix.I), I'm guessing it will be much faster than doing hundreds of statsmodel OLS calls.

Here are the questions that I have:

Is the numpy stuff that I am doing faster than the earlier method of calling statsmodels many times? If so, is it the fastest/most efficient way to accomplish what I want? I'm assuming that it is, but if you know of a better way then I would be happy to hear it. (Surely I'm not the first person to need to calculate many regressions in this way.)
How do I deal with missing data in my matrices? My time series data is not going to be nice and complete, and will be missing values occasionally. If I just try to do regular matrix multiplication in numpy, the NA values will propagate and I'll end up with a matrix of mostly NAs as my end result. If I do each regression independently, I can just drop the rows containing NAs before I perform my regression, but if I do this on the large 2000 x 500 matrix I will end up dropping actual, non-NA data from some of my other variables, and I obviously don't want that to happen.
What is the most efficient way to ensure that my time series data actually lines up correctly before I put it into the matrices in the first place? The start and end dates for my observations are not necessarily the same, and some series might have days that others do not. If I were to pick a method for doing this, I would put all the observations into a pandas data frame indexed by the date. Then pandas will end up doing all of the work aligning everything for me and I can extract the underlying ndarray after it is done. Is this the best method, or does pandas have some sort of overhead that I can avoid by doing the matrix construction in a different way?

回答1:

some brief answers

1) Calling statsmodels repeatedly is not the fastest way. If we just need parameters, prediction and residual and we have identical explanatory variables, then I usually just use params = pinv(x).dot(y) where y is 2 dimensional and calculate the rest from there. The problem is that inference, confidence intervals and similar require work, so unless speed is crucial and only a restricted set of results is required, statsmodels OLS is still more convenient.

This only works if all y and x have the same observations indices, no missing values and no gaps.

Aside: The setup is a Multivariate Linear Model which will be supported by statsmodels in, hopefully not very far, future.

2) and 3) The fast simple linear algebra of case 1) does not work if there are missing cells or no complete overlap of observation (indices). In the analog to panel data, the first case requires "balanced" panels, the other cases imply "unbalanced" data. The standard way is to stack the data with the explanatory variables in a block-diagonal form. Since this increases the memory by a large amount, using sparse matrices and sparse linear algebra is better. It depends on the specific cases whether building and solving the sparse problem is faster than looping over individual OLS regressions.

Specialized code: (Just a thought):

In case 2) with not fully overlapping or cellwise missing values, we would still need to calculate all x'x, and x'y matrices for all y, i.e. 500 of those. Given that you only have two regressors 500 x 2 x 2 would still not require a large memory. So it might be possible to calculate params, prediction and residuals by using the non-missing mask as weights in the cross-product calculations. numpy has vectorized linalg.inv, as far as I know. So, I think, this could be done with a few vectorized calculations.

来源：https://stackoverflow.com/questions/40287113/fastest-way-to-calculate-many-regressions-in-python

标签

python

python-3.x

numpy

linear-regression