Cubic spline method for longitudinal series data?

前端 未结 1 781
忘了有多久
忘了有多久 2021-02-15 23:28

I have a serial data formatted as follows:

time    milk    Animal_ID
30      25.6    1
31      27.2    1
32      24.4    1
33      17.4    1
34      33.6    1
35         


        
相关标签:
1条回答
  • 2021-02-16 00:16

    I would suggest you use mgcv package. This is one of the recommended R packages, performing a class of models called generalized additive mixed models. You can simply load it by library(mgcv). This is a very powerful library, which can handle from the simplest linear regression model, to generalized linear models, to additive models, to generalized additive models, as well as models with mixed effects (fixed effects + random effects). You can list all (exported) functions of mgcv via

    ls("package:mgcv")
    

    And you can see there are many of them.

    For your specific data and problem, you may use a model with formula:

    model <- milk ~ s(time, bs = 'cr', k = 100) + s(Animal_ID, bs = 're')
    

    In mgcv, s() is a setup for smooth functions, represented by spline basis implied by bs. "cr" is the cubic spline basis, which is exactly what you want. k is the number of knots. It should be chosen depending on the number of unique values of variable time in your data set. If you set k to exactly this number, you end up with a smoothing spline; while any value smaller than that means a regression spline. However, both will be penalized (if you know what penalization mean). I read your data in:

    dat <- na.omit(read.csv("data.txt", header = TRUE))  ## I saved you data into file "data.txt"
    dat$Animal_ID <- factor(dat$Animal_ID)
    nrow(dat)  ## 12624 observations
    length(unique(dat$time))  ## 157 unique time points
    length(ID <- levels(dat$Animal_ID))  ## 355 cows
    

    There are 157 unique values, so I reckon k = 100 is possibly appropriate.

    For Animal_ID (coerced as a factor), we need a model term for random effect. "re" is a special class for i.i.d random effect. It is passed to bs for some internal matrix construction reason (so this is not a smooth function!).

    Now to fit a GAM model, you can call the legacy gam or the constantly developing bam (gam for big data). I think you will use the latter. They have the same calling convention similar to lm and glm. For example, you can do:

    fit <- bam(model, data = dat, family = "gaussian", discrete = TRUE, nthreads = 2)
    

    As you can see, bam allows multi-core parallel computation via nthreads. While discrete is a newly developed feature which can speed up matrix formation.

    Since you are dealing with time series data, finally you might consider some temporal autocorrelation. mgcv allows configuration of AR1 correlation, whose correlation coefficient is passed by bam argument rho. However, you need an extra index AR_start to tell mgcv how the time series breaks up into pieces. For example, when reaching a different Animal_ID, AR_start get a TRUE to indicate a new segment of time series. See ?bam for details.

    mgcv also provides

    1. summary.gam function for model summary
    2. gam.check for basic model checking
    3. plot.gam function for plotting individual terms
    4. predict.gam (or predict.bam) for prediction on new data.

    For example, the summary of the above suggested model is:

    > summary(fit)
    
    Family: gaussian 
    Link function: identity 
    
    Formula:
    milk ~ s(time, bs = "cr", k = 100) + s(Animal_ID, bs = "re")
    
    Parametric coefficients:
            Estimate Std. Error t value Pr(>|t|)    
    (Intercept)  26.1950     0.2704   96.89   <2e-16 ***
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    
    Approximate significance of smooth terms:
                    edf Ref.df       F  p-value    
    s(time)       10.81  13.67   5.908 1.99e-11 ***
    s(Animal_ID) 351.43 354.00 136.449  < 2e-16 ***
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    
    R-sq.(adj) =  0.805   Deviance explained = 81.1%
    fREML =  29643  Scale est. = 5.5681    n = 12624
    

    The edf (effective degree of freedom) may be thought of as a measure of the degree of non-linearity. So we put in k = 100, while ending up with edf = 10.81. This suggest that the spline s(time) has been heavily penalized. You can view the what s(time) looks like by:

    plot.gam(fit, page = 1)
    

    Note that the random effect s(Animal_ID) also has a "smooth", that is an cow-specific constant. For random effects, a Gaussian QQ plot will be returned.

    The diagnostic figures returned by

    invisible(gam.check(fit))
    

    looks OK, so I think the model is acceptable (I am not offering you model selection, so think up a better model if you think there is).

    If you want to make prediction for Animal_ID = 26, you may do

    newd <- data.frame(time = 1:150, Animal_ID = 26)
    oo <- predict.gam(fit, newd, type = `link`, se.fit = TRUE)
    

    Note that

    • You need to include both variables in newd (otherwise mgcv complains missing variable)
    • since you have only one spline smooth s(time), and the random effect term s(Animal_ID) is a constant per Animal_ID. so it is OK to use type = 'link' for individual prediction. By the way, type = 'terms' is slower than type = 'link'.

    If you want to make prediction for more than one cows, try something like this:

    pred.ID <- ID[1:10]  ## predict first 10 cows
    newd <- data.frame (time = rep (1:150, times = n), Animal_ID = factor (rep (pred.ID, each = 150)))
    oo <- predict.bam (fit, newd, type = "link", se.fit = TRUE)
    

    Note that

    • I have used predict.bam here, as now we have 150 * 10 = 1500 data points to predict. Plus: we require se.fit = TRUE. This is rather expensive, so use predict.bam is faster than predict.gam. Particularly, if you have fitted your model using bam(..., discrete = TRUE), you can have predict.bam(..., discrete = TRUE). Prediction process goes through the same matrix formation steps as in model fitting (see ?smoothCon used in model fitting and ?PredictMat used in prediction, if you are keen to know more internal structure of mgcv.)
    • I specified Animal_ID as factors, because this is a random effect.

    For more on mgcv, you can refer to library manual. Check specially ?mgcv, ?gam, ?bam ?s.


    Final update

    Though I said that I will not help you with model section, but I think this model is better (it gives higher adj-Rsquared) and is also more reasonable in sense:

    model <- milk ~ s(time, bs = 'cr', k = 20) + s(Animal_ID, bs = 're') + s(Animal_ID, time, bs = 're')
    

    The last term is imposing a random slop. This implies that we are assuming that each individual cow has different growing/reducing pattern of milk production. This is a more sensible assumption in your problem. The earlier model with only random intercept is not sufficient. After adding this random slop, the smooth term s(time) looks smoother. This is a good sign not a bad sign, because we want some simple explanation for s(time), don't we? Compare the s(time) you get from both models, and see what you discover.

    I have also reduced k = 100 to k = 20. As we saw in previous fit, the edf for this term is about 10, so k = 20 is pretty sufficient.

    0 讨论(0)
提交回复
热议问题