Linear regression in NumPy with very large matrices - how to save memory?

后端未结

关注

 3  945

时光取名叫无心 2021-02-06 16:42

So I have these ginormous matrices X and Y. X and Y both have 100 million rows, and X has 10 columns. I\'m trying to implement linear regression with these matrices, and I nee

3条回答

余生分开走 (楼主)

2021-02-06 17:29
A neat property of ordinary least squares regression is that if you have two datasets X1, Y1 and X2, Y2 and you have already computed all of
- X1' * X1
- X1' * Y1
- X2' * X2
- X2' * Y2
And you now want to do the regression on the combined dataset X = [X1; X2] and Y = [Y1; Y2], you don't actually have to recompute very much. The relationships
- X' * X = X1' * X1 + X2' * X2
- X' * Y = X1' * Y1 + X2' * Y2
hold, so with these computed you just calculate
- beta = inv(X' * X) * (X' * Y)
and you're done. This leads to a simple algorithm for OLS on very large datasets:
- Load in part of the dataset (say, the first million rows) and compute X' * X and X' * Y (which are quite small matrices) and store them.
- Keep doing this for the next million rows, until you have processed the whole dataset.
- Add together all of the X' * Xs and X' * Ys that you have stored
- Compute beta = inv(X' * X) \ (X' * Y)
That is not appreciably slower than loading in the entire dataset at once, and it uses far less memory.

Final note: you should never compute beta by first computing (X' * X) and finding its inverse (for two reasons - 1. it is slow, and 2. it is prone to numerical error).

Instead, you should solve the linear system -
- (X' * X) * beta = X' * Y
In MATLAB this is a simple one-liner
```
beta = (X' * X) \ (X' * Y);
```
and I expect that numpy has a similar way of solving linear systems without needing to invert a matrix.
0 讨论(0)

查看其它3个回答
发布评论:

提交评论
- 加载中...