How to compute minimal but fast linear regressions on each column of a response matrix?

回眸只為那壹抹淺笑 提交于 2019-12-05 11:56:58
Dirk Eddelbuettel

Have a look at the fastLm() function in the RcppArmadillo package on CRAN. There is also a similar fastLm() in RcppGSL which preceded this -- but you probably want the Armadillo-based solution. I have some slides in older presentations (on HPC with R) that show the speed gains.

Also note the hint in the help page about better 'pivoted' approaches than the straight inverse of X'X which can matter with degenerate model matrices.

Following Marek's comment, below are the results of comparing the built-in functions "lm" and "lm.fit", my own function, "fastLm" and "fastLmPure" from the package RcppArmadillo:

> system.time( res1 <- apply(x, 1, function(xi){lm(y ~ xi)}) )
   user  system elapsed
  2.859   0.005   2.865
> system.time( res2 <- apply(x, 1, function(xi){apply(y, 2, GetResFromCustomLinReg, xi)}) )
   user  system elapsed
  4.620   0.004   4.626
> system.time( res3 <- apply(x, 1, function(xi){lm.fit(x=cbind(1,xi), y=y)}) )
   user  system elapsed
  0.454   0.004   0.458
> system.time( res4 <- apply(x, 1, function(xi){apply(y, 2, fastLm, x=cbind(1,xi))}) )
   user  system elapsed
  2.279   0.005   2.283
> system.time( res5 <- apply(x, 1, function(xi){apply(y, 2, fastLmPure, cbind(1,xi))}) )
   user  system elapsed
  1.053   0.003   1.056

However, be careful when comparing these numbers. The differences are due not only to the different implementations, but also to which results are effectively computed:

> names(res1$p1)
 [1] "coefficients"  "residuals"     "effects"       "rank"        
 [5] "fitted.values" "assign"        "qr"            "df.residual" 
 [9] "xlevels"       "call"          "terms"         "model"       
> # res2 (from my own custom function) returns the estimate of beta, its standard error, the estimate of sigma and the R^2
> names(res3$p1)
[1] "coefficients"  "residuals"     "effects"       "rank"        
[5] "fitted.values" "assign"        "qr"            "df.residual" 
> names(res4$p1$out1)
[1] "coefficients"  "stderr"        "df"            "fitted.values"
[5] "residuals"     "call"        
> names(res5$p1$out1)
[1] "coefficients" "stderr"       "df"         

For instance, we may prefer to use "lm.fit" over "lm", but if we need the R^2, we will have to compute it by ourselves. Idem, we may want to use "fastLm" over "lm", but if we want the estimate of sigma, we will have to compute it by ourselves. And computing such things with a custom R function may not be very efficient (compare to what is done by "lm").

In the light of all this, I will keep using "lm" for the moment, but indeed Dirk's comment about "fastLm" is good advice (that's why I chose his answer, as it should be of interest for other people).

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!