问题
I want to persist a lm
object to a file and reload it into another program. I know I can do this by writing/reading a binary file via saveRDS
/readRDS
, but I'd like to have an ASCII file instead of a binary file. At a more general level, I'd like to know why my idioms for reading in dput
output in general is not behaving as I'd expect.
Below are examples of making a simple fit, and successful and unsuccessful recreations of the model:
dat_train <- data.frame(x=1:4, z=c(1, 2.1, 2.9, 4))
fit <- lm(z ~ x, dat_train)
rm(dat_train) # Just to make sure fit is not dependent upon `dat_train existence`
dat_score <- data.frame(x=c(1.5, 3.5))
## This works (of course)
predict(fit, dat_score)
# 1 2
# 1.52 3.48
Saving to binary file works:
## http://stackoverflow.com/questions/5118074/reusing-a-model-built-in-r
saveRDS(fit, "model.RDS")
fit2 <- readRDS("model.RDS")
predict(fit2, dat_score)
# 1 2
# 1.52 3.48
So does this (dput
it in the R session not to a file):
fit2 <- eval(dput(fit))
predict(fit2, dat_score)
# 1 2
# 1.52 3.48
But if I persist file to disk, I cannot figure out how to get back into normal shape:
dput(fit, file = "model.R")
fit3 <- source("model.R")$value
# Error in is.data.frame(data): object 'dat_train' not found
predict(fit3, dat_score)
# Error in predict(fit3, dat_score): object 'fit3' not found
Trying to be explicit with the eval
does not work either:
## http://stackoverflow.com/questions/9068397/import-text-file-as-single-character-string
dput(fit, file="model.R")
fit4 <- eval(parse(text=paste(readLines("model.R"), collapse=" ")))
# Error in is.data.frame(data): object 'dat_train' not found
predict(fit4, dat_score)
# Error in predict(fit4, dat_score): object 'fit4' not found
In both cases above, I expect fit3
and fit4
to both work, but they don't recompile into a lm
object that I can use with predict()
.
Can anyone advise me on how I can persist a model to a file with a structure(...)
ASCII-like structure, and then re-read it back in as a lm
object I can use in predict()
? And why my current methods are not working?
回答1:
Step 1:
You need to control de-parsing options:
dput(fit, control = c("quoteExpressions", "showAttributes"), file = "model.R")
You can read more on all possible options in ?.deparseOpts
.
The "quoteExpressions" wraps all calls / expressions / languages with quote
, so that they are not evaluated when you later re-parse it. Note:
source
is doing parsing;call
field in your fitted "lm" object is a call:fit$call # lm(formula = z ~ x, data = dat_train)
So, without "quoteExpressions", R will try to evaluate lm
call during parsing. And if we evaluate it, it is fitting a linear model, and R will aim to find dat_train
, which will not exist in your new R session.
The "showAttributes" is another mandatory option, as "lm" object has class attributes. You certainly don't want to discard all class attributes and only export a plain "list" object, right? Moreover, many elements in a "lm" object, like model
(the model frame), qr
(the compact QR matrix) and terms
(terms info), etc all have attributes. You want to keep them all.
If you don't set control
, the default setting with:
control = c("keepNA", "keepInteger", "showAttributes")
will be used. As you can see, there is no "quoteExpressions", so you will get into trouble.
You can also specify "keepInteger" and "keepNA", but I don't see the need for "lm" object.
------
Step 2:
The above step will get source
working correctly. You can recover your model:
fit1 <- source("model.R")$value
However, it is not yet ready for generic functions like summary
and predict
to work. Why?
The critical issue is the terms
object in fit1
is not really a "terms" object, but only a formula (it is even not a formula, but only a "language" object without "formula" class!). Just compare fit$terms
and fit1$terms
, and you will see the difference. Don't be surprised; we've set "quoteExpressions" earlier. While that is definitely helpful to prevent evaluation of call
, it has side-effect for terms
. So we need to reconstruct terms
as best as we can.
Fortunately, it is sufficient to do:
fit1$terms <- terms.formula(fit1$terms)
Though this still does not recover all information in fit$terms
(like variable classes are missing), it is readily a valid "terms" object.
Why is a "terms" object critical? Because all generic functions rely on it. You may not need to know more on this, as it is really technical, so I will stop here.
Once this is done, we can successfully use predict
(and summary
, too):
predict(fit1) ## no `newdata` given, using model frame `fit1$model`
# 1 2 3 4
#1.03 2.01 2.99 3.97
predict(fit1, dat_score) ## with `newdata`
# 1 2
#1.52 3.48
-------
Conclusion remark:
Although I have shown you how to get things work, I don't really recommend you doing this in general. An "lm" object will be pretty large when you fit a model to a large dataset, for example, residuals
, fitted.values
are long vectors, and qr
and model
are huge matrices / data frames. So think about this.
回答2:
This is an important update!
As mentioned in the previous answer, the most challenging bit is to recover $terms
as best as we can. The suggested method using terms.formula
works for OP's example, but not for the following with bs()
and poly()
:
dat <- data.frame(x1 = runif(20), x2 = runif(20), x3 = runif(20), y = rnorm(20))
library(splines)
fit <- lm(y ~ bs(x1, df = 3) + poly(x2, degree = 3) + x3, data = dat)
rm(dat)
If we follow the previous answer:
dput(fit, control = c("quoteExpressions", "showAttributes"), file = "model.R")
fit1 <- source("model.R")$value
fit1$terms <- terms.formula(fit1$terms)
We will see that summary.lm
and anova.lm
work correctly, but not predict.lm
:
predict(fit1, newdata = data.frame(x1 = 0.5, x2 = 0.5, x3 = 0.5))
Error in bs(x1, df = 3) : could not find function "bs"
This is because ".Environment"
attribute of $terms
is missing. We need
environment(fit1$terms) <- .GlobalEnv
Now run above predict
again we see a different error:
Error in poly(x2, degree = 3) :
'degree' must be less than number of unique points
This is because we are missing "predvars"
attributes for safe / correct prediction of bs()
and poly()
.
A remedy is that we need to dput
such special attribute additionally:
dput(attr(fit$terms, "predvars"), control = "quoteExpressions", file = "predvars.R")
then read and add it
attr(fit1$terms, "predvars") <- source("predvars.R")$value
Now running predict
works correctly.
Note that "dataClass"
attribute of $terms
is also missing, but this does not seem to cause any problem for any generic functions.
来源:https://stackoverflow.com/questions/41645120/how-to-correctly-dput-a-fitted-linear-model-by-lm-to-an-ascii-file-and-rec