Fast linear regression by group

I have 500K users and I need to compute a linear regression (with intercept) for each of them.

Each user has around 30 records.

I tried with dplyr and lm and this is way too slow. Around 2 sec by user.

  df%>%                       
      group_by(user_id, add =  FALSE) %>%
      do(lm = lm(Y ~ x, data = .)) %>%
      mutate(lm_b0 = summary(lm)$coeff[1],
             lm_b1 = summary(lm)$coeff[2]) %>%
      select(user_id, lm_b0, lm_b1) %>%
      ungroup()
    )

I tried to use lm.fit which is known to be faster but it doesn't seem to be compatible with dplyr.

Is there a fast way to do a linear regression by group?

You can just use the basic formulas for calculating slope and regression. lm does a lot of unnecessary things if all you care about are those two numbers. Here I use data.table for the aggregation, but you could do it in base R as well (or dplyr):

system.time(
  res <- DT[, 
    {
      ux <- mean(x)
      uy <- mean(y)
      slope <- sum((x - ux) * (y - uy)) / sum((x - ux) ^ 2)
      list(slope=slope, intercept=uy - slope * ux)
    }, by=user.id
  ]
)

Produces for 500K users ~30 obs each (in seconds):

 user  system elapsed 
 7.35    0.00    7.36

Or about 15 microseconds per user. And to confirm this is working as expected:

> summary(DT[user.id==89663, lm(y ~ x)])$coefficients
             Estimate Std. Error   t value  Pr(>|t|)
(Intercept) 0.1965844  0.2927617 0.6714826 0.5065868
x           0.2021210  0.5429594 0.3722580 0.7120808
> res[user.id == 89663]
   user.id    slope intercept
1:   89663 0.202121 0.1965844

Data:

set.seed(1)
users <- 5e5
records <- 30
x <- runif(users * records)
DT <- data.table(
  x=x, y=x + runif(users * records) * 4 - 2, 
  user.id=sample(users, users * records, replace=T)
)

If all you want is coefficients, I'd just use user_id as a factor in the regression. Using @miles2know's simulated data code (though renaming since an object other than exp() sharing that name looks weird to me)

dat <- data.frame(id = rep(c("a","b","c"), each = 20),
                  x = rnorm(60,5,1.5),
                  y = rnorm(60,2,.2))

mod = lm(y ~ x:id + id + 0, data = dat)

We fit no global intercept (+ 0) so that the intercept for each id is the id coefficient, and no x by itself, so that the x:id interactions are the slopes for each id:

coef(mod)
#      ida      idb      idc    x:ida    x:idb    x:idc 
# 1.779686 1.893582 1.946069 0.039625 0.033318 0.000353

So, for the a level of id, the ida coefficient, 1.78, is the intercept and the x:ida coefficient, 0.0396, is the slope.

I'll leave the gathering of these coefficients into appropriate columns of a data frame to you...

This solution should be very fast because you're not having to deal with subsets of data frames. It could probably be sped up even more with fastLm or such.

Note on scalability:

I did just try this on @nrussell's simulated full-size data and ran into memory allocation issues. Depending on how much memory you have it may not work in one go, but you could probably do it in batches of user ids. Some combination of his answer and my answer might be the fastest overall---or nrussell's might just be faster---expanding the user id factor into thousands of dummy variables might not be computationally efficient, as I've been waiting more than a couple minutes now for a run on just 5000 user ids.

Update: As pointed out by Dirk, my original approach can be greatly improved upon by specifying x and Y directly rather than using the formula-based interface of fastLm, which incurs (a fairly significant) processing overhead. For comparison, using the original full size data set,

R> system.time({
  dt[,c("lm_b0", "lm_b1") := as.list(
    unname(fastLm(x, Y)$coefficients))
    ,by = "user_id"]
})
#  user  system elapsed 
#55.364   0.014  55.401 
##
R> system.time({
  dt[,c("lm_b0","lm_b1") := as.list(
    unname(fastLm(Y ~ x, data=.SD)$coefficients))
    ,by = "user_id"]
})
#   user  system elapsed 
#356.604   0.047 356.820

this simple change yields roughly a 6.5x speedup.

[Original approach]

There is probably some room for improvement, but the following took about 25 minutes on a Linux VM (2.6 GHz processor), running 64-bit R:

library(data.table)
library(RcppArmadillo)
##
dt[
  ,c("lm_b0","lm_b1") := as.list(
    unname(fastLm(Y ~ x, data=.SD)$coefficients)),
  by=user_id]
##
R> dt[c(1:2, 31:32, 61:62),]
   user_id   x         Y     lm_b0    lm_b1
1:       1 1.0 1674.8316 -202.0066 744.6252
2:       1 1.5  369.8608 -202.0066 744.6252
3:       2 1.0  463.7460 -144.2961 374.1995
4:       2 1.5  412.7422 -144.2961 374.1995
5:       3 1.0  513.0996  217.6442 261.0022
6:       3 1.5 1140.2766  217.6442 261.0022

Data:

dt <- data.table(
  user_id = rep(1:500000,each=30))
##
dt[, x := seq(1, by=.5, length.out=30), by = user_id]
dt[, Y := 1000*runif(1)*x, by = user_id]
dt[, Y := Y + rnorm(
  30, 
  mean = sample(c(-.05,0,0.5)*mean(Y),1), 
  sd = mean(Y)*.25), 
  by = user_id]

miles2know

You might give this a try using data.table like this. I've just created some toy data but I'd imagine data.table would give some improvement. It's quite speedy. But that is quite a large data-set so perhaps benchmark this approach on a smaller sample to see if the speed is a lot better. good luck.


    library(data.table)

    exp <- data.table(id = rep(c("a","b","c"), each = 20), x = rnorm(60,5,1.5), y = rnorm(60,2,.2))
    # edit: it might also help to set a key on id with such a large data-set
    # with the toy example it would make no diff of course
    exp <- setkey(exp,id)
    # the nuts and bolts of the data.table part of the answer
    result <- exp[, as.list(coef(lm(y ~ x))), by=id]
    result
       id (Intercept)            x
    1:  a    2.013548 -0.008175644
    2:  b    2.084167 -0.010023549
    3:  c    1.907410  0.015823088

An example using Rfast.

Assuming a single response and 500K predictor variables.

y <- rnorm(30)
x <- matrnorm(500*1000,30)
system.time( Rfast::univglms(y, x,"normal") )  ## 0.70 seconds

Assuming 500K response variables and a singl predictor variable.

system.time( Rfast::mvbetas(x,y) )  ## 0.60 seconds

Note: The above times will decrease in the nearby future.

来源：https://stackoverflow.com/questions/29803993/fast-linear-regression-by-group

标签

dplyr