How to correctly `dput` a fitted linear model (by `lm`) to an ASCII file and recreate it later?

梦想与她 提交于 2019-12-08 22:29:39

问题


I want to persist a lm object to a file and reload it into another program. I know I can do this by writing/reading a binary file via saveRDS/readRDS, but I'd like to have an ASCII file instead of a binary file. At a more general level, I'd like to know why my idioms for reading in dput output in general is not behaving as I'd expect.

Below are examples of making a simple fit, and successful and unsuccessful recreations of the model:

dat_train <- data.frame(x=1:4, z=c(1, 2.1, 2.9, 4))
fit <- lm(z ~ x, dat_train)
rm(dat_train) # Just to make sure fit is not dependent upon `dat_train existence`

dat_score <- data.frame(x=c(1.5, 3.5))

## This works (of course)
predict(fit, dat_score)
#    1    2 
# 1.52 3.48

Saving to binary file works:

## http://stackoverflow.com/questions/5118074/reusing-a-model-built-in-r
saveRDS(fit, "model.RDS")
fit2 <- readRDS("model.RDS")
predict(fit2, dat_score)
#    1    2 
# 1.52 3.48

So does this (dput it in the R session not to a file):

fit2 <- eval(dput(fit))
predict(fit2, dat_score)
#    1    2 
# 1.52 3.48

But if I persist file to disk, I cannot figure out how to get back into normal shape:

dput(fit, file = "model.R")
fit3 <- source("model.R")$value

# Error in is.data.frame(data): object 'dat_train' not found

predict(fit3, dat_score)
# Error in predict(fit3, dat_score): object 'fit3' not found

Trying to be explicit with the eval does not work either:

## http://stackoverflow.com/questions/9068397/import-text-file-as-single-character-string
dput(fit, file="model.R")
fit4 <- eval(parse(text=paste(readLines("model.R"), collapse=" ")))

# Error in is.data.frame(data): object 'dat_train' not found

predict(fit4, dat_score)
# Error in predict(fit4, dat_score): object 'fit4' not found

In both cases above, I expect fit3 and fit4 to both work, but they don't recompile into a lm object that I can use with predict().

Can anyone advise me on how I can persist a model to a file with a structure(...) ASCII-like structure, and then re-read it back in as a lm object I can use in predict()? And why my current methods are not working?


回答1:


Step 1:

You need to control de-parsing options:

dput(fit, control = c("quoteExpressions", "showAttributes"), file = "model.R") 

You can read more on all possible options in ?.deparseOpts.


The "quoteExpressions" wraps all calls / expressions / languages with quote, so that they are not evaluated when you later re-parse it. Note:

  • source is doing parsing;
  • call field in your fitted "lm" object is a call:

    fit$call
    # lm(formula = z ~ x, data = dat_train)
    

So, without "quoteExpressions", R will try to evaluate lm call during parsing. And if we evaluate it, it is fitting a linear model, and R will aim to find dat_train, which will not exist in your new R session.


The "showAttributes" is another mandatory option, as "lm" object has class attributes. You certainly don't want to discard all class attributes and only export a plain "list" object, right? Moreover, many elements in a "lm" object, like model (the model frame), qr (the compact QR matrix) and terms (terms info), etc all have attributes. You want to keep them all.


If you don't set control, the default setting with:

control = c("keepNA", "keepInteger", "showAttributes")

will be used. As you can see, there is no "quoteExpressions", so you will get into trouble.

You can also specify "keepInteger" and "keepNA", but I don't see the need for "lm" object.

------

Step 2:

The above step will get source working correctly. You can recover your model:

fit1 <- source("model.R")$value

However, it is not yet ready for generic functions like summary and predict to work. Why?

The critical issue is the terms object in fit1 is not really a "terms" object, but only a formula (it is even not a formula, but only a "language" object without "formula" class!). Just compare fit$terms and fit1$terms, and you will see the difference. Don't be surprised; we've set "quoteExpressions" earlier. While that is definitely helpful to prevent evaluation of call, it has side-effect for terms. So we need to reconstruct terms as best as we can.

Fortunately, it is sufficient to do:

fit1$terms <- terms.formula(fit1$terms)

Though this still does not recover all information in fit$terms (like variable classes are missing), it is readily a valid "terms" object.

Why is a "terms" object critical? Because all generic functions rely on it. You may not need to know more on this, as it is really technical, so I will stop here.

Once this is done, we can successfully use predict (and summary, too):

predict(fit1)  ## no `newdata` given, using model frame `fit1$model`
#   1    2    3    4 
#1.03 2.01 2.99 3.97 

predict(fit1, dat_score)  ## with `newdata`
#   1    2 
#1.52 3.48 

-------

Conclusion remark:

Although I have shown you how to get things work, I don't really recommend you doing this in general. An "lm" object will be pretty large when you fit a model to a large dataset, for example, residuals, fitted.values are long vectors, and qr and model are huge matrices / data frames. So think about this.




回答2:


This is an important update!

As mentioned in the previous answer, the most challenging bit is to recover $terms as best as we can. The suggested method using terms.formula works for OP's example, but not for the following with bs() and poly():

dat <- data.frame(x1 = runif(20), x2 = runif(20), x3 = runif(20), y = rnorm(20))
library(splines)
fit <- lm(y ~ bs(x1, df = 3) + poly(x2, degree = 3) + x3, data = dat)
rm(dat)

If we follow the previous answer:

dput(fit, control = c("quoteExpressions", "showAttributes"), file = "model.R") 
fit1 <- source("model.R")$value
fit1$terms <- terms.formula(fit1$terms)

We will see that summary.lm and anova.lm work correctly, but not predict.lm:

predict(fit1, newdata = data.frame(x1 = 0.5, x2 = 0.5, x3 = 0.5))

Error in bs(x1, df = 3) : could not find function "bs"

This is because ".Environment" attribute of $terms is missing. We need

environment(fit1$terms) <- .GlobalEnv

Now run above predict again we see a different error:

Error in poly(x2, degree = 3) :

'degree' must be less than number of unique points

This is because we are missing "predvars" attributes for safe / correct prediction of bs() and poly().

A remedy is that we need to dput such special attribute additionally:

dput(attr(fit$terms, "predvars"), control = "quoteExpressions", file = "predvars.R")

then read and add it

attr(fit1$terms, "predvars") <- source("predvars.R")$value

Now running predict works correctly.

Note that "dataClass" attribute of $terms is also missing, but this does not seem to cause any problem for any generic functions.



来源:https://stackoverflow.com/questions/41645120/how-to-correctly-dput-a-fitted-linear-model-by-lm-to-an-ascii-file-and-rec

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!