So I have these ginormous matrices X and Y. X and Y both have 100 million rows, and X has 10 columns. I\'m trying to implement linear regression with these matrices, and I nee
RAM's pretty cheap - you should consider investing. A system with 24 Gig of RAM doesn't necessarily cost an arm and a leg anymore - one of Dell's lower-end servers can pack in that much.
If the matrices are sparse (lots of zeros), use a sparse matrix class to save a lot of RAM.
If the matrices aren't sparse, you'll either want more RAM (or at least more Virtual Memory), or to do your matrix operations using disk files.
Disk files are of course an order of magnitude slower than RAM, and thrashing your virtual memory system could actually be worse than that, depending on your access patterns.
the size of X is 100e6 x 10 the size of Y is 100e6 x 1
so the final size of (X^T*X)^-1 * X^T * Y
is 10 x 1
you can calculate it by following step:
a = X^T*X
-> 10 x 10b = X^T*Y
-> 10 x 1a^-1 * b
matrixs in step 3 is very small, so you just need to do some intermediate steps to calculate 1 & 2.
For example you can read column 0 of X and Y,
and calculate it by numpy.dot(X0, Y)
.
for float64 dtype, the size of X0 and Y is about 1600M, if it cann't fit the memory, you can call numpy.dot twice for the first half and second half of X0 & Y separately.
So to calculate X^T*Y
you need call numpy.dot 20 times,
to calculate X^T*X
you need call numpy.dot 200 times.
A neat property of ordinary least squares regression is that if you have two datasets X1, Y1 and X2, Y2 and you have already computed all of
And you now want to do the regression on the combined dataset X = [X1; X2] and Y = [Y1; Y2], you don't actually have to recompute very much. The relationships
hold, so with these computed you just calculate
and you're done. This leads to a simple algorithm for OLS on very large datasets:
That is not appreciably slower than loading in the entire dataset at once, and it uses far less memory.
Final note: you should never compute beta by first computing (X' * X) and finding its inverse (for two reasons - 1. it is slow, and 2. it is prone to numerical error).
Instead, you should solve the linear system -
In MATLAB this is a simple one-liner
beta = (X' * X) \ (X' * Y);
and I expect that numpy has a similar way of solving linear systems without needing to invert a matrix.