This question relates to a machine learning feature selection procedure.
I have a large matrix of features - columns are the features of the subjects (rows):
set.seed(1)
features.mat <- matrix(rnorm(10*100),ncol=100)
colnames(features.mat) <- paste("F",1:100,sep="")
rownames(features.mat) <- paste("S",1:10,sep="")
The response was measured for each subject (S
) under different conditions (C
) and therefore looks like this:
response.df <-
data.frame(S = c(sapply(1:10, function(x) rep(paste("S", x, sep = ""),100))),
C = rep(paste("C", 1:100, sep = ""), 10),
response = rnorm(1000), stringsAsFactors = F)
So I match the subjects in response.df
:
match.idx <- match(response.df$S, rownames(features.mat))
I'm looking for a fast way to compute the univariate regression of each feature and the response.
Anything faster than this?:
fun <- function(f){
fit <- lm(response.df$response ~ features.mat[match.idx,f])
beta <- coef(summary(fit))
data.frame(feature = colnames(features.mat)[f], effect = beta[2,1],
p.val = beta[2,4], stringsAsFactors = F))
}
res <- do.call(rbind, lapply(1:ncol(features.mat), fun))
I am interested in marginal boost, i.e., methods other than using parallel computing via mclapply
or mclapply2
.
I would provide a light-weighed toy routine for estimation of a simple regression model: y ~ x
, i.e., a regression line with only an intercept and slope. As will be seen, this is 36 times faster than lm
+ summary.lm
.
## toy data
set.seed(0)
x <- runif(50)
y <- 0.3 * x + 0.1 + rnorm(50, sd = 0.05)
## fast estimation of simple linear regression: y ~ x
simplelm <- function (x, y) {
## number of data
n <- length(x)
## centring
y0 <- sum(y) / length(y); yc <- y - y0
x0 <- sum(x) / length(x); xc <- x - x0
## fitting an intercept-free model: yc ~ xc + 0
xty <- c(crossprod(xc, yc))
xtx <- c(crossprod(xc))
slope <- xty / xtx
rc <- yc - xc * slope
## Pearson estimate of residual standard error
sigma2 <- c(crossprod(rc)) / (n - 2)
## standard error for slope
slope_se <- sqrt(sigma2 / xtx)
## t-score and p-value for slope
tscore <- slope / slope_se
pvalue <- 2 * pt(abs(tscore), n - 2, lower.tail = FALSE)
## return estimation summary for slope
c("Estimate" = slope, "Std. Error" = slope_se, "t value" = tscore, "Pr(>|t|)" = pvalue)
}
Let's have a test:
simplelm(x, y)
# Estimate Std. Error t value Pr(>|t|)
#2.656737e-01 2.279663e-02 1.165408e+01 1.337380e-15
On the other hand, lm
+ summary.lm
gives:
coef(summary(lm(y ~ x)))
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 0.1154549 0.01373051 8.408633 5.350248e-11
#x 0.2656737 0.02279663 11.654079 1.337380e-15
So the result matches. If you require R-squared and adjusted R-squared, it can be easily computed, too.
Let's have a benchmark:
set.seed(0)
x <- runif(10000)
y <- 0.3 * x + 0.1 + rnorm(10000, sd = 0.05)
library(microbenchmark)
microbenchmark(coef(summary(lm(y ~ x))), simplelm(x, y))
#Unit: microseconds
# expr min lq mean median uq
# coef(summary(lm(y ~ x))) 14158.28 14305.28 17545.1544 14444.34 17089.00
# simplelm(x, y) 235.08 265.72 485.4076 288.20 319.46
# max neval cld
# 114662.2 100 b
# 3409.6 100 a
Holy!!! We have 36 times boost!
Remark-1 (solving normal equation)
The simplelm
is based on solving normal equation via Cholesky factorization. But since it is simple, no actual matrix computation is involved. If we need regression with multiple covariates, we can use the lm.chol
defined in my this answer.
Normal equation can also be solved by using LU factorization. I will not touch on this, but if you feel interested, here is it: Solving normal equation gives different coefficients from using lm
?.
Remark-2 (alternative via cor.test
)
The simplelm
is an extension to the fastsim
in my answer Monte Carlo simulation of correlation between two Brownian motion (continuous random walk). An alternative way is based on cor.test
. It is also much faster than lm
+ summary.lm
, but as shown in that answer, it is yet slower than my proposal above.
Remark-3 (alternative via QR method)
QR based method is also possible, in which case we want to use .lm.fit
, a light-weighed wrapper for qr.default
, qr.coef
, qr.fitted
and qr.resid
at C-level. Here is how we can add this option to our simplelm
:
## fast estimation of simple linear regression: y ~ x
simplelm <- function (x, y, QR = FALSE) {
## number of data
n <- length(x)
## centring
y0 <- sum(y) / length(y); yc <- y - y0
x0 <- sum(x) / length(x); xc <- x - x0
## fitting intercept free model: yc ~ xc + 0
if (QR) {
fit <- .lm.fit(matrix(xc), yc)
slope <- fit$coefficients
rc <- fit$residuals
} else {
xty <- c(crossprod(xc, yc))
xtx <- c(crossprod(xc))
slope <- xty / xtx
rc <- yc - xc * slope
}
## Pearson estimate of residual standard error
sigma2 <- c(crossprod(rc)) / (n - 2)
## standard error for slope
if (QR) {
slope_se <- sqrt(sigma2) / abs(fit$qr[1])
} else {
slope_se <- sqrt(sigma2 / xtx)
}
## t-score and p-value for slope
tscore <- slope / slope_se
pvalue <- 2 * pt(abs(tscore), n - 2, lower.tail = FALSE)
## return estimation summary for slope
c("Estimate" = slope, "Std. Error" = slope_se, "t value" = tscore, "Pr(>|t|)" = pvalue)
}
For our toy data, both QR method and Cholesky method give the same result:
set.seed(0)
x <- runif(50)
y <- 0.3 * x + 0.1 + rnorm(50, sd = 0.05)
simplelm(x, y, TRUE)
# Estimate Std. Error t value Pr(>|t|)
#2.656737e-01 2.279663e-02 1.165408e+01 1.337380e-15
simplelm(x, y, FALSE)
# Estimate Std. Error t value Pr(>|t|)
#2.656737e-01 2.279663e-02 1.165408e+01 1.337380e-15
QR methods is known to be 2 ~ 3 times slower than Cholesky method (Read my answer Why the built-in lm function is so slow in R? for detailed explanation). Here is a quick check:
set.seed(0)
x <- runif(10000)
y <- 0.3 * x + 0.1 + rnorm(10000, sd = 0.05)
library(microbenchmark)
microbenchmark(simplelm(x, y, TRUE), simplelm(x, y))
#Unit: microseconds
# expr min lq mean median uq max neval cld
# simplelm(x, y, TRUE) 776.88 873.26 1073.1944 908.72 933.82 3420.92 100 b
# simplelm(x, y) 238.32 292.02 441.9292 310.44 319.32 3515.08 100 a
So indeed, 908 / 310 = 2.93
.
Remark-4 (simple regression for GLM)
If we move on to GLM, there is also a fast, light-weighed version based on glm.fit
. You can read my answer R loop help: leave out one observation and run glm one variable at a time and use function f
defined there. At the moment f
is customized to logistic regression, but we can generalize it to other response easily.
来源:https://stackoverflow.com/questions/40141738/is-there-a-fast-estimation-of-simple-regression-a-regression-line-with-only-int