I am currenlty computing glm
models off a huge data data set. Both glm
and even speedglm
take days to compute.
I currently have around 3M observations and altogether 400 variables, only some of which are used for the regression. In my regression I use 4 integer independent variables (iv1
, iv2
, iv3
, iv4
), 1 binary independent variable as factor (iv5
), the interaction term (x * y
, where x
is an integer and y
is a binary dummy variable as factor). Finally, I have fixed effects along years ff1
and company ids ff2
. I have 15 years and 3000 conmpanies. I have introduced the fixed effects by adding them as factors. I observe that especially the 3000 company fixed effects make the computation very slow in stats
glm
and also speedglm
.
I therefore decided to try Microsoft R's rxGlm
(RevoScaleR), as this can address more threads and processor cores. Indeed, the speed of analysis is a lot faster. Also, I compared the results for a sub-sample to the one of standard glm
and they matched.
I used the following function:
mod1 <- rxGlm(formula = dv ~
iv1 + iv2 + iv3+
iv4 + iv5 +
x * y +
ff1 + ff2,
family = binomial(link = "probit"), data = dat,
dropFirst = TRUE, dropMain = FALSE, covCoef = TRUE, cube = FALSE)
However, I am facing a problem when trying to plot the interaction term using the effects
package. Upon calling the following function, I am receiving the following error:
> plot(effect("x*y", mod1))
Error in terms.default(model) : no terms component nor attribute
I assume the problem is that rxGlm
does not store the data needed to plot the interaction. I believe so because the rxGlm
object is a lot smaller than the glm
oject, hence likely containing less data (80 MB vs several GB).
I now tried to convert the rxGlm
object to glm
via as.glm()
. Still, the effects()
call does not yield a result and results in the following error messages:
Error in dnorm(eta) :
Non-numerical argument for mathematical function
In addition: Warning messages:
1: In model.matrix.default(mod, data = list(dv = c(1L, 2L, :
variable 'x for y' is absent, its contrast will be ignored
If I now compare an original glm to the "converted glm", I find that the converted glm contains a lot less items. E.g., it does not contain effects
and for contrasts it states only contr.treatment
for each variable.
I am now looking primarily for a way to transpose the rxGlm
output object in a format so I can use if with the effect()
function. If there is no way to do so, how can I get an interaction plot using functions within the RevoScaleR
package, e.g., rxLinePlot()
? rxLinePlot()
also plots reasonably quick, however, I have not yet found a way how to get typical interaction effect plots out of it. I want to avoid first calculating the full glm
model and then plot because this takes very long.
If you can get the coefficients can't you just roll your own? This would not be a dataset size issue
# ex. data
n = 2000
dat <- data.frame( dv = sample(0:1, size = n, rep = TRUE),
iv1 = sample(1:10, size = n, rep = TRUE),
iv2 = sample(1:10, size = n, rep = TRUE),
iv3 = sample(1:10, size = n, rep = TRUE),
iv4 = sample(0:10, size = n, rep = TRUE),
iv5 = as.factor(sample(0:1, size = n, rep = TRUE)),
x = sample(1:100, size = n, rep = TRUE),
y = as.factor(sample(0:1, size = n, rep = TRUE)),
ff1 = as.factor(sample(1:15, size = n, rep = TRUE)),
ff2 = as.factor(sample(1:100, size = n, rep = TRUE))
)
mod1 <- glm(formula = dv ~
iv1 + iv2 + iv3+
iv4 + iv5 +
x * y +
ff1 + ff2,
family = binomial(link = "probit"), data = dat)
# coefficients for x, y and their interaction
x1 <- coef(mod1)['x']
y1 <- coef(mod1)['y1']
xy <- coef(mod1)['x:y1']
x <- 1:100
a <- x1*x
b <- x1*x + y1 + xy*x
plot(a~x, type= 'line', col = 'red', xlim = c(0,max(x)), ylim = range(c(a, b)))
lines(b~x, col = 'blue')
legend('topright', c('y = 0', 'y = 1'), col = c('red', 'blue'))
来源:https://stackoverflow.com/questions/47080343/how-to-plot-interaction-effects-from-extremely-large-data-sets-esp-from-rxglm