问题
I am currenlty computing glm
models off a huge data data set. Both glm
and even speedglm
take days to compute.
I currently have around 3M observations and altogether 400 variables, only some of which are used for the regression. In my regression I use 4 integer independent variables (iv1
, iv2
, iv3
, iv4
), 1 binary independent variable as factor (iv5
), the interaction term (x * y
, where x
is an integer and y
is a binary dummy variable as factor). Finally, I have fixed effects along years ff1
and company ids ff2
. I have 15 years and 3000 conmpanies. I have introduced the fixed effects by adding them as factors. I observe that especially the 3000 company fixed effects make the computation very slow in stats
glm
and also speedglm
.
I therefore decided to try Microsoft R's rxGlm
(RevoScaleR), as this can address more threads and processor cores. Indeed, the speed of analysis is a lot faster. Also, I compared the results for a sub-sample to the one of standard glm
and they matched.
I used the following function:
mod1 <- rxGlm(formula = dv ~
iv1 + iv2 + iv3+
iv4 + iv5 +
x * y +
ff1 + ff2,
family = binomial(link = "probit"), data = dat,
dropFirst = TRUE, dropMain = FALSE, covCoef = TRUE, cube = FALSE)
However, I am facing a problem when trying to plot the interaction term using the effects
package. Upon calling the following function, I am receiving the following error:
> plot(effect("x*y", mod1))
Error in terms.default(model) : no terms component nor attribute
I assume the problem is that rxGlm
does not store the data needed to plot the interaction. I believe so because the rxGlm
object is a lot smaller than the glm
oject, hence likely containing less data (80 MB vs several GB).
I now tried to convert the rxGlm
object to glm
via as.glm()
. Still, the effects()
call does not yield a result and results in the following error messages:
Error in dnorm(eta) :
Non-numerical argument for mathematical function
In addition: Warning messages:
1: In model.matrix.default(mod, data = list(dv = c(1L, 2L, :
variable 'x for y' is absent, its contrast will be ignored
If I now compare an original glm to the "converted glm", I find that the converted glm contains a lot less items. E.g., it does not contain effects
and for contrasts it states only contr.treatment
for each variable.
I am now looking primarily for a way to transpose the rxGlm
output object in a format so I can use if with the effect()
function. If there is no way to do so, how can I get an interaction plot using functions within the RevoScaleR
package, e.g., rxLinePlot()
? rxLinePlot()
also plots reasonably quick, however, I have not yet found a way how to get typical interaction effect plots out of it. I want to avoid first calculating the full glm
model and then plot because this takes very long.
回答1:
If you can get the coefficients can't you just roll your own? This would not be a dataset size issue
# ex. data
n = 2000
dat <- data.frame( dv = sample(0:1, size = n, rep = TRUE),
iv1 = sample(1:10, size = n, rep = TRUE),
iv2 = sample(1:10, size = n, rep = TRUE),
iv3 = sample(1:10, size = n, rep = TRUE),
iv4 = sample(0:10, size = n, rep = TRUE),
iv5 = as.factor(sample(0:1, size = n, rep = TRUE)),
x = sample(1:100, size = n, rep = TRUE),
y = as.factor(sample(0:1, size = n, rep = TRUE)),
ff1 = as.factor(sample(1:15, size = n, rep = TRUE)),
ff2 = as.factor(sample(1:100, size = n, rep = TRUE))
)
mod1 <- glm(formula = dv ~
iv1 + iv2 + iv3+
iv4 + iv5 +
x * y +
ff1 + ff2,
family = binomial(link = "probit"), data = dat)
# coefficients for x, y and their interaction
x1 <- coef(mod1)['x']
y1 <- coef(mod1)['y1']
xy <- coef(mod1)['x:y1']
x <- 1:100
a <- x1*x
b <- x1*x + y1 + xy*x
plot(a~x, type= 'line', col = 'red', xlim = c(0,max(x)), ylim = range(c(a, b)))
lines(b~x, col = 'blue')
legend('topright', c('y = 0', 'y = 1'), col = c('red', 'blue'))
来源:https://stackoverflow.com/questions/47080343/how-to-plot-interaction-effects-from-extremely-large-data-sets-esp-from-rxglm