Linear regression in NumPy with very large matrices - how to save memory?

后端 未结 3 935
时光取名叫无心
时光取名叫无心 2021-02-06 16:42

So I have these ginormous matrices X and Y. X and Y both have 100 million rows, and X has 10 columns. I\'m trying to implement linear regression with these matrices, and I nee

3条回答
  •  余生分开走
    2021-02-06 17:29

    A neat property of ordinary least squares regression is that if you have two datasets X1, Y1 and X2, Y2 and you have already computed all of

    • X1' * X1
    • X1' * Y1
    • X2' * X2
    • X2' * Y2

    And you now want to do the regression on the combined dataset X = [X1; X2] and Y = [Y1; Y2], you don't actually have to recompute very much. The relationships

    • X' * X = X1' * X1 + X2' * X2
    • X' * Y = X1' * Y1 + X2' * Y2

    hold, so with these computed you just calculate

    • beta = inv(X' * X) * (X' * Y)

    and you're done. This leads to a simple algorithm for OLS on very large datasets:

    • Load in part of the dataset (say, the first million rows) and compute X' * X and X' * Y (which are quite small matrices) and store them.
    • Keep doing this for the next million rows, until you have processed the whole dataset.
    • Add together all of the X' * Xs and X' * Ys that you have stored
    • Compute beta = inv(X' * X) \ (X' * Y)

    That is not appreciably slower than loading in the entire dataset at once, and it uses far less memory.

    Final note: you should never compute beta by first computing (X' * X) and finding its inverse (for two reasons - 1. it is slow, and 2. it is prone to numerical error).

    Instead, you should solve the linear system -

    • (X' * X) * beta = X' * Y

    In MATLAB this is a simple one-liner

    beta = (X' * X) \ (X' * Y);
    

    and I expect that numpy has a similar way of solving linear systems without needing to invert a matrix.

提交回复
热议问题