So I have these ginormous matrices X and Y. X and Y both have 100 million rows, and X has 10 columns. I\'m trying to implement linear regression with these matrices, and I nee
A neat property of ordinary least squares regression is that if you have two datasets X1, Y1 and X2, Y2 and you have already computed all of
And you now want to do the regression on the combined dataset X = [X1; X2] and Y = [Y1; Y2], you don't actually have to recompute very much. The relationships
hold, so with these computed you just calculate
and you're done. This leads to a simple algorithm for OLS on very large datasets:
That is not appreciably slower than loading in the entire dataset at once, and it uses far less memory.
Final note: you should never compute beta by first computing (X' * X) and finding its inverse (for two reasons - 1. it is slow, and 2. it is prone to numerical error).
Instead, you should solve the linear system -
In MATLAB this is a simple one-liner
beta = (X' * X) \ (X' * Y);
and I expect that numpy has a similar way of solving linear systems without needing to invert a matrix.