问题
This Q & A arises from How to make group_by and lm fast? where OP was trying to do a simple linear regression per group for a large data frame.
In theory, a series of group-by regression y ~ x | g
is equivalent to a single pooled regression y ~ x * g
. The latter is very appealing because statistical test between different groups is straightforward. But in practice doing this larger regression is not computationally easy. My answer on the linked Q & A reviews packages speedlm
and glm4
, but pointed out that they can't well address this problem.
Large regression problem is difficult, particularly when there are factor variables. This may explain why many people abandon this idea and prefer to splitting data by group and fitting models by group. There is no point for me to enumerate methods on group-by regression (see Linear Regression and group by in R). What I care is speed.
For simple linear regression as y ~ x | g
, splitting data by group then relying on standard model fitting routines like lm
is a performance killer. First of all, subsetting a large data frame is inefficient. Secondly, standard model fitting routines follow the procedure below, which are sheer overhead to the useful regression computation.
- parse model formula to "terms" object (using
terms.formula
); - construct model frame (using
model.frame.default
); - build model matrix (using
model.matrix.default
).
There are clever computational tricky for simple linear regression. As I demonstrated in Fast pairwise simple linear regression between variables in a data frame, the covariance method is extremely fast. Can we adapt it to group-by simple linear regression via a group_by_simpleLM
function?
回答1:
We have to do this by writing compiled code. I would present this with Rcpp. Be aware that I am a C programmer and have been using R's conventional C interface. Rcpp is used just to ease handling of lists, strings and attributes, as well as to facilitate immediate testing in R. The code is largely written in C-style. Macros from R's conventional C interface like REAL
and INTEGER
are still used. See the bottom of this answer for "group_by_simpleLM.cpp".
The R wrapper function group_by_simpleLM
has four arguments:
group_by_simpleLM <- function (dat, LHS, RHS, group) {
##.... [TRUNCATED]
dat
is a data frame. If you feed a matrix or a list, it will stop and complain.LHS
is a character vector giving the names for the variables on the left hand side of~
. Multiple LHS variables are supported.RHS
is a character vector giving the names for the variables on the right hand side of~
. Only a single, non-factor RHS variable is allowed in simple linear regression. You can provide a vector of variables toRHS
, but the function will only retain the first element (with a warning). If that variable is not found indat
(maybe because you've mistyped the name) or it is not a numerical variable, it will give you an informative error message.group
is a character vector giving the name of the grouping variable. It should best be readily a factor indat
, otherwise the function usesmatch(group, unique(group))
for a fast coercion and a warning is issued. A factor with unused levels is no harm.group_by_simpleLM_cpp
sees this and returns you allNaN
for such levels.group
can beNULL
so that a single regression is done for all data.
The workhorse function group_by_simpleLM_cpp
returns a named list of matrices to hold regression results for each response. Each matrix is "wide" with nlevels(group)
columns and 5 rows:
- "alpha" (for intercept);
- "beta" (for slope);
- "beta.se" (for standard error of slope);
- "r2" (for R-squared);
- "df.resid" (for residual degree of freedom);
For a simple linear regression, these five statistics are sufficient to obtain other statistics.
The function watches out for rank-deficient case where there is only one datum in a group. Slope can not be estimated then and NaN
is returned. Another special case is when a group only has two data. The fit is then perfect and you get 0 standard error for the slope.
The function is a fast method for nlme::lmList(RHS ~ LHS | group, dat, pool = FALSE)
when group != NULL
, and a fast method for lm(RHS ~ LHS, dat)
when group = NULL
(may even be faster than general_paired_simpleLM because it is written in C).
Caution:
- weighted regression is not handled, because the covariance method is invalid in that case.
- no check for
NA
/NaN
/Inf
/-Inf
indat
is made and functions break in their presence.
Examples
library(Rcpp)
sourceCpp("group_by_simpleLM.cpp")
## a toy dataset
set.seed(0)
dat <- data.frame(y1 = rnorm(10), y2 = rnorm(10), x = 1:5,
f = gl(2, 5, labels = letters[1:2]),
g = sample(gl(2, 5, labels = LETTERS[1:2])))
group-by regression: a fast method of nlme::lmList
group_by_simpleLM(dat, c("y1", "y2"), "x", "f")
#$y1
# a b
#alpha 0.820107094 -2.7164723
#beta -0.009796302 0.8812007
#beta.se 0.266690568 0.2090644
#r2 0.000449565 0.8555330
#df.resid 3.000000000 3.0000000
#
#$y2
# a b
#alpha 0.1304709 0.06996587
#beta -0.1616069 -0.14685953
#beta.se 0.2465047 0.24815024
#r2 0.1253142 0.10454374
#df.resid 3.0000000 3.00000000
fit <- nlme::lmList(cbind(y1, y2) ~ x | f, data = dat, pool = FALSE)
## results for level "a"; use `fit[[2]]` to see results for level "b"
lapply(summary(fit[[1]]), "[", c("coefficients", "r.squared"))
#$`Response y1`
#$`Response y1`$coefficients
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 0.820107094 0.8845125 0.92718537 0.4222195
#x -0.009796302 0.2666906 -0.03673284 0.9730056
#
#$`Response y1`$r.squared
#[1] 0.000449565
#
#
#$`Response y2`
#$`Response y2`$coefficients
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 0.1304709 0.8175638 0.1595850 0.8833471
#x -0.1616069 0.2465047 -0.6555936 0.5588755
#
#$`Response y2`$r.squared
#[1] 0.1253142
dealing with rank-deficiency without break-down
## with unused level "b"
group_by_simpleLM(dat[1:5, ], "y1", "x", "f")
#$y1
# a b
#alpha 0.820107094 NaN
#beta -0.009796302 NaN
#beta.se 0.266690568 NaN
#r2 0.000449565 NaN
#df.resid 3.000000000 NaN
## rank-deficient case for level "b"
group_by_simpleLM(dat[1:6, ], "y1", "x", "f")
#$y1
# a b
#alpha 0.820107094 -1.53995
#beta -0.009796302 NaN
#beta.se 0.266690568 NaN
#r2 0.000449565 NaN
#df.resid 3.000000000 0.00000
more than one grouping variables
When we have more than one grouping variables, group_by_simpleLM
can not handle them directly. But you can use interaction
to first create a single factor variable.
dat$fg <- with(dat, interaction(f, g, drop = TRUE, sep = ":"))
group_by_simpleLM(dat, c("y1", "y2"), "x", "fg")
#$y1
# a:A b:A a:B b:B
#alpha 1.4750325 -2.7684583 -1.6393289 -1.8513669
#beta -0.2120782 0.9861509 0.7993313 0.4613999
#beta.se 0.0000000 0.2098876 0.4946167 0.0000000
#r2 1.0000000 0.9566642 0.7231188 1.0000000
#df.resid 0.0000000 1.0000000 1.0000000 0.0000000
#
#$y2
# a:A b:A a:B b:B
#alpha 1.0292956 -0.22746944 -1.5096975 0.06876360
#beta -0.2657021 -0.20650690 0.2547738 0.09172993
#beta.se 0.0000000 0.01945569 0.3483856 0.00000000
#r2 1.0000000 0.99120195 0.3484482 1.00000000
#df.resid 0.0000000 1.00000000 1.0000000 0.00000000
fit <- nlme::lmList(cbind(y1, y2) ~ x | fg, data = dat, pool = FALSE)
## note that the first group a:A only has two values, so df.resid = 0
## my method returns 0 standard error for the slope
## but lm or lmList would return NaN
lapply(summary(fit[[1]]), "[", c("coefficients", "r.squared"))
#$`Response y1`
#$`Response y1`$coefficients
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 1.4750325 NaN NaN NaN
#x -0.2120782 NaN NaN NaN
#
#$`Response y1`$r.squared
#[1] 1
#
#
#$`Response y2`
#$`Response y2`$coefficients
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 1.0292956 NaN NaN NaN
#x -0.2657021 NaN NaN NaN
#
#$`Response y2`$r.squared
#[1] 1
no grouping: a fast method of lm
group_by_simpleLM(dat, c("y1", "y2"), "x", NULL)
#$y1
# ALL
#alpha -0.9481826
#beta 0.4357022
#beta.se 0.2408162
#r2 0.2903691
#df.resid 8.0000000
#
#$y2
# ALL
#alpha 0.1002184
#beta -0.1542332
#beta.se 0.1514935
#r2 0.1147012
#df.resid 8.0000000
fast large simple linear regression
set.seed(0L)
nSubj <- 200e3
nr <- 1e6
DF <- data.frame(subject = gl(nSubj, 5),
day = 3:7,
y1 = runif(nr),
y2 = rpois(nr, 3),
y3 = rnorm(nr),
y4 = rnorm(nr, 1, 5))
system.time(group_by_simpleLM(DF, paste0("y", 1:4), "day", "subject"))
# user system elapsed
# 0.200 0.016 0.219
library(MatrixModels)
system.time(glm4(y1 ~ 0 + subject + day:subject, data = DF, sparse = TRUE))
# user system elapsed
# 9.012 0.172 9.266
group_by_simpleLM
does all 4 responses almost instantly, while glm4
needs 9s for one response alone!
Note that glm4
can break down in rank-deficient case, while group_by_simpleLM
would not.
Appendix: "group_by_simpleLM.cpp"
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
List group_by_simpleLM_cpp (List Y, NumericVector x, IntegerVector group, CharacterVector group_levels, bool group_unsorted) {
/* number of data and number of responses */
int n = x.size(), k = Y.size(), n_groups = group_levels.size();
/* set up result list */
List result(k);
List dimnames = List::create(CharacterVector::create("alpha", "beta", "beta.se", "r2", "df.resid"), group_levels);
int j; for (j = 0; j < k; j++) {
NumericMatrix mat(5, n_groups);
mat.attr("dimnames") = dimnames;
result[j] = mat;
}
result.attr("names") = Y.attr("names");
/* set up a vector to hold sample size for each group */
size_t *group_offset = (size_t *)calloc(n_groups + 1, sizeof(size_t));
/*
compute group offset: cumsum(group_offset)
The offset is used in a different way when group is sorted or unsorted
In the former case, it is the offset to real x, y values;
In the latter case, it is the offset to ordering index indx
*/
int *u = INTEGER(group), *u_end = u + n, i;
if (n_groups > 1) {
while (u < u_end) group_offset[*u++]++;
for (i = 0; i < n_groups; i++) group_offset[i + 1] += group_offset[i];
} else {
group_offset[1] = n;
group_unsorted = 0;
}
/* local variables & pointers */
double *xi, *xi_end; /* pointer to the 1st and the last x value */
double *yi; /* pointer to the first y value */
int gi; double inv_gi; /* sample size of the i-th group */
double xi_mean, xi_var; /* mean & variance of x values in the i-th group */
double yi_mean, yi_var; /* mean & variance of y values in the i-th group */
double xiyi_cov; /* covariance between x and y values in the i-th group */
double beta, r2; int df_resi;
double *matij;
/* additional storage and variables when group is unsorted */
int *indx; double *xb, *xbi, dtmp;
if (group_unsorted) {
indx = (int *)malloc(n * sizeof(int));
xb = (double *)malloc(n * sizeof(double)); // buffer x for caching
R_orderVector1(indx, n, group, TRUE, FALSE); // Er, how is TRUE & FALSE recogonized as Rboolean?
}
/* loop through groups */
for (i = 0; i < n_groups; i++) {
/* set group size gi */
gi = group_offset[i + 1] - group_offset[i];
/* special case for a factor level with no data */
if (gi == 0) {
for (j = 0; j < k; j++) {
/* matrix column for write-back */
matij = REAL(result[j]) + i * 5;
matij[0] = R_NaN; matij[1] = R_NaN; matij[2] = R_NaN;
matij[3] = R_NaN; matij[4] = R_NaN;
}
continue;
}
/* rank-deficient case */
if (gi == 1) {
gi = group_offset[i];
if (group_unsorted) gi = indx[gi];
for (j = 0; j < k; j++) {
/* matrix column for write-back */
matij = REAL(result[j]) + i * 5;
matij[0] = REAL(Y[j])[gi];
matij[1] = R_NaN; matij[2] = R_NaN;
matij[3] = R_NaN; matij[4] = 0.0;
}
continue;
}
/* general case where a regression line can be estimated */
inv_gi = 1 / (double)gi;
/* compute mean & variance of x values in this group */
xi_mean = 0.0; xi_var = 0.0;
if (group_unsorted) {
/* use u, u_end and xbi */
xi = REAL(x);
u = indx + group_offset[i]; /* offset acts on index */
u_end = u + gi;
xbi = xb + group_offset[i];
for (; u < u_end; xbi++, u++) {
dtmp = xi[*u];
xi_mean += dtmp;
xi_var += dtmp * dtmp;
*xbi = dtmp;
}
} else {
/* use xi and xi_end */
xi = REAL(x) + group_offset[i]; /* offset acts on values */
xi_end = xi + gi;
for (; xi < xi_end; xi++) {
xi_mean += *xi;
xi_var += (*xi) * (*xi);
}
}
xi_mean = xi_mean * inv_gi;
xi_var = xi_var * inv_gi - xi_mean * xi_mean;
/* loop through responses doing simple linear regression */
for (j = 0; j < k; j++) {
/* compute mean & variance of y values, as well its covariance with x values */
yi_mean = 0.0; yi_var = 0.0; xiyi_cov = 0.0;
if (group_unsorted) {
xbi = xb + group_offset[i]; /* use buffered x values */
yi = REAL(Y[j]);
u = indx + group_offset[i]; /* offset acts on index */
for (; u < u_end; u++, xbi++) {
dtmp = yi[*u];
yi_mean += dtmp;
yi_var += dtmp * dtmp;
xiyi_cov += dtmp * (*xbi);
}
} else {
/* set xi and yi */
xi = REAL(x) + group_offset[i]; /* offset acts on values */
yi = REAL(Y[j]) + group_offset[i]; /* offset acts on values */
for (; xi < xi_end; xi++, yi++) {
yi_mean += *yi;
yi_var += (*yi) * (*yi);
xiyi_cov += (*yi) * (*xi);
}
}
yi_mean = yi_mean * inv_gi;
yi_var = yi_var * inv_gi - yi_mean * yi_mean;
xiyi_cov = xiyi_cov * inv_gi - xi_mean * yi_mean;
/* locate the right place to write back regression result */
matij = REAL(result[j]) + i * 5 + 4;
/* residual degree of freedom */
df_resi = gi - 2; *matij-- = (double)df_resi;
/* R-squared = squared correlation */
r2 = (xiyi_cov * xiyi_cov) / (xi_var * yi_var); *matij-- = r2;
/* standard error of regression slope */
if (df_resi == 0) *matij-- = 0.0;
else *matij-- = sqrt((1 - r2) * yi_var / (df_resi * xi_var));
/* regression slope */
beta = xiyi_cov / xi_var; *matij-- = beta;
/* regression intercept */
*matij = yi_mean - beta * xi_mean;
}
}
if (group_unsorted) {
free(indx);
free(xb);
}
free(group_offset);
return result;
}
/*** R
group_by_simpleLM <- function (dat, LHS, RHS, group = NULL) {
## basic input validation
if (missing(dat)) stop("no data provided to 'dat'!")
if (!is.data.frame(dat)) stop("'dat' must be a data frame!")
if (missing(LHS)) stop("no 'LHS' provided!")
if (!is.character(LHS)) stop("'LHS' must be provided as a character vector of variable names!")
if (missing(RHS)) stop("no 'RHS' provided!")
if (!is.character(RHS)) stop("'RHS' must be provided as a character vector of variable names!")
if (!is.null(group)) {
## grouping variable provided: a fast method of `nlme::lmList`
if (!is.character(group)) stop("'group' must be provided as a character vector of variable names!")
## ensure that group has length 1, is available in the data frame and is a factor
if (length(group) > 1L) {
warning("only one grouping variable allowed for group-by simple linear regression; ignoring all but the 1st variable provided!")
group <- group[1L]
}
grp <- dat[[group]]
if (is.null(grp)) stop(sprintf("grouping variable '%s' not found in 'dat'!", group))
if (is.factor(grp)) {
grp_levels <- levels(grp)
} else {
warning("grouping variable is not provided as a factor; fast coercion is made!")
grp_levels <- unique(grp)
grp <- match(grp, grp_levels)
grp_levels <- as.character(grp_levels)
}
grp_unsorted <- .Internal(is.unsorted(grp, FALSE))
} else {
## no grouping; a fast method of `lm`
grp <- 1L; grp_levels <- "ALL"; grp_unsorted <- FALSE
}
## the RHS must has length 1, is available in the data frame and is numeric
if (length(RHS) > 1L) {
warning("only one RHS variable allowed for simple linear regression; ignoring all but the 1st variable provided!")
RHS <- RHS[1L]
}
x <- dat[[RHS]]
if (is.null(x)) stop(sprintf("RHS variable '%s' not found in 'dat'!", RHS))
if (!is.numeric(x) || is.factor(x)) {
stop("RHS variable must be 'numeric' for simple linear regression!")
}
x < as.numeric(x) ## just in case that `x` is an integer
## check LHS variables
nested <- match(RHS, LHS, nomatch = 0L)
if (nested > 0L) {
warning(sprintf("RHS variable '%s' found in LHS variables; removing it from LHS", RHS))
LHS <- LHS[-nested]
}
if (length(LHS) == 0L) stop("no usable LHS variables found!")
missed <- !(LHS %in% names(dat))
if (any(missed)) {
warning(sprintf("LHS variables '%s' not found in 'dat'; removing them from LHS", toString(LHS[missed])))
LHS <- LHS[!missed]
}
if (length(LHS) == 0L) stop("no usable LHS variables found!")
Y <- dat[LHS]
invalid_LHS <- vapply(Y, is.factor, FALSE) | (!vapply(Y, is.numeric, FALSE))
if (any(invalid_LHS)) {
warning(sprintf("LHS variables '%s' are non-numeric or factors; removing them from LHS", toString(LHS[invalid_LHS])))
Y <- Y[!invalid_LHS]
}
if (length(Y) == 0L) stop("no usable LHS variables found!")
Y <- lapply(Y, as.numeric) ## just in case that we have integer variables in Y
## check for exsitence of NA, NaN, Inf and -Inf and drop them?
## use Rcpp
group_by_simpleLM_cpp(Y, x, grp, grp_levels, grp_unsorted)
}
*/
来源:https://stackoverflow.com/questions/51996021/fast-group-by-simple-linear-regression