Linear regression in NumPy with very large matrices - how to save memory?

后端 未结 3 936
时光取名叫无心
时光取名叫无心 2021-02-06 16:42

So I have these ginormous matrices X and Y. X and Y both have 100 million rows, and X has 10 columns. I\'m trying to implement linear regression with these matrices, and I nee

相关标签:
3条回答
  • 2021-02-06 17:17

    RAM's pretty cheap - you should consider investing. A system with 24 Gig of RAM doesn't necessarily cost an arm and a leg anymore - one of Dell's lower-end servers can pack in that much.

    If the matrices are sparse (lots of zeros), use a sparse matrix class to save a lot of RAM.

    If the matrices aren't sparse, you'll either want more RAM (or at least more Virtual Memory), or to do your matrix operations using disk files.

    Disk files are of course an order of magnitude slower than RAM, and thrashing your virtual memory system could actually be worse than that, depending on your access patterns.

    0 讨论(0)
  • 2021-02-06 17:26

    the size of X is 100e6 x 10 the size of Y is 100e6 x 1

    so the final size of (X^T*X)^-1 * X^T * Y is 10 x 1

    you can calculate it by following step:

    1. calculate a = X^T*X -> 10 x 10
    2. calculate b = X^T*Y -> 10 x 1
    3. calculate a^-1 * b

    matrixs in step 3 is very small, so you just need to do some intermediate steps to calculate 1 & 2.

    For example you can read column 0 of X and Y, and calculate it by numpy.dot(X0, Y).

    for float64 dtype, the size of X0 and Y is about 1600M, if it cann't fit the memory, you can call numpy.dot twice for the first half and second half of X0 & Y separately.

    So to calculate X^T*Y you need call numpy.dot 20 times, to calculate X^T*X you need call numpy.dot 200 times.

    0 讨论(0)
  • 2021-02-06 17:29

    A neat property of ordinary least squares regression is that if you have two datasets X1, Y1 and X2, Y2 and you have already computed all of

    • X1' * X1
    • X1' * Y1
    • X2' * X2
    • X2' * Y2

    And you now want to do the regression on the combined dataset X = [X1; X2] and Y = [Y1; Y2], you don't actually have to recompute very much. The relationships

    • X' * X = X1' * X1 + X2' * X2
    • X' * Y = X1' * Y1 + X2' * Y2

    hold, so with these computed you just calculate

    • beta = inv(X' * X) * (X' * Y)

    and you're done. This leads to a simple algorithm for OLS on very large datasets:

    • Load in part of the dataset (say, the first million rows) and compute X' * X and X' * Y (which are quite small matrices) and store them.
    • Keep doing this for the next million rows, until you have processed the whole dataset.
    • Add together all of the X' * Xs and X' * Ys that you have stored
    • Compute beta = inv(X' * X) \ (X' * Y)

    That is not appreciably slower than loading in the entire dataset at once, and it uses far less memory.

    Final note: you should never compute beta by first computing (X' * X) and finding its inverse (for two reasons - 1. it is slow, and 2. it is prone to numerical error).

    Instead, you should solve the linear system -

    • (X' * X) * beta = X' * Y

    In MATLAB this is a simple one-liner

    beta = (X' * X) \ (X' * Y);
    

    and I expect that numpy has a similar way of solving linear systems without needing to invert a matrix.

    0 讨论(0)
提交回复
热议问题