R data.table loop subset by factor and do lm()

前端 未结 3 444
猫巷女王i
猫巷女王i 2021-01-13 12:03

I am trying to create a function or even just work out how to run a loop using data.table syntax where I can subset the table by factor, in this case the id variable, then r

3条回答
  •  不知归路
    2021-01-13 12:46

    I know you want to do this with data tables, and if you want some specific aspect of the fit, like the coefficients, then @MartinBel's approach is a good one.

    On the other hand, if you want to store the fits themselves, lapply(...) might be a better option:

    set.seed(1)
    df <- data.frame(id = letters[1:3], 
                     cyl = sample(c("a","b","c"), 30, replace = TRUE),
                     factor = sample(c(TRUE, FALSE), 30, replace = TRUE),   
                     hp = sample(c(20:50), 30, replace = TRUE))
    dt <- data.table(df,key="id")
    
    fits <- lapply(unique(df$id),
                   function(z)lm(hp~cyl+factor, data=dt[J(z),], y=T))
    # coefficients
    sapply(fits,coef)
    #                   [,1]      [,2]          [,3]
    # (Intercept)  44.117647 35.000000  3.933333e+01
    # cylb         -6.117647 -6.321429 -1.266667e+01
    # cylc        -13.176471  3.821429 -7.833333e+00
    # factorTRUE    1.176471  5.535714  2.325797e-15
    
    # predicted values
    sapply(fits,predict)
    #        [,1]     [,2]     [,3]
    # 1  45.29412 28.67857 26.66667
    # 2  32.11765 35.00000 31.50000
    # 3  30.94118 34.21429 26.66667
    # ...
    
    # residuals
    sapply(fits,residuals)
    #           [,1]        [,2]      [,3]
    # 1    2.7058824   0.3214286  7.333333
    # 2   -2.1176471   5.0000000 -4.500000
    # 3    3.0588235   8.7857143 -4.666667
    # ...
    
    # se and r-sq
    sapply(fits, function(x)c(se=summary(x)$sigma, rsq=summary(x)$r.squared))
    #         [,1]      [,2]      [,3]
    # se  7.923655 8.6358196 6.4592741
    # rsq 0.463076 0.3069017 0.4957024
    
    # Q-Q plots
    par(mfrow=c(1,length(fits)))
    lapply(fits,plot,2)
    

    Note the use of key="id" in the call to data.table(...), and the use if dt[J(z)] to subset the data table. This really isn't necessary unless dt is enormous.

提交回复
热议问题