rpart: Computational time for categorical vs continuous regressors

别来无恙 提交于 2019-12-05 08:18:51

You need to profile the code to be sure, but I would be surprised if the timing difference did not come from R having to turn each factor variable into two binary variables as it prepares the model matrix.

Try

Rprof("rpartProfile.Rprof")
rpart( y ~ ., data = x.fac, method = 'anova')
Rprof()

summaryRprof("rpartProfile.Rprof")

and look to see where the time is being spent. Which I have now done:

> summaryRprof("rpartProfile.Rprof")
$by.self
                          self.time self.pct total.time total.pct
"[[<-.data.frame"            786.46    72.45     786.56     72.46
"rpart.matrix"               294.26    27.11    1081.78     99.66
"model.frame.default"          1.04     0.10       3.00      0.28
"terms.formula"                0.96     0.09       0.96      0.09
"as.list.data.frame"           0.46     0.04       0.46      0.04
"makepredictcall.default"      0.46     0.04       0.46      0.04
"rpart"                        0.44     0.04    1085.38     99.99
"[[.data.frame"                0.16     0.01       0.42      0.04
"<Anonymous>"                  0.16     0.01       0.18      0.02
"match"                        0.14     0.01       0.22      0.02
"print"                        0.12     0.01       0.12      0.01
"model.matrix.default"         0.10     0.01       0.44      0.04
....

$by.total
                          total.time total.pct self.time self.pct
"rpart"                      1085.38     99.99      0.44     0.04
"rpart.matrix"               1081.78     99.66    294.26    27.11
"[[<-"                        786.62     72.47      0.06     0.01
"[[<-.data.frame"             786.56     72.46    786.46    72.45
"model.frame.default"           3.00      0.28      1.04     0.10
"eval"                          3.00      0.28      0.04     0.00
"eval.parent"                   3.00      0.28      0.00     0.00
"model.frame"                   3.00      0.28      0.00     0.00
"terms.formula"                 0.96      0.09      0.96     0.09
"terms"                         0.96      0.09      0.00     0.00
"makepredictcall"               0.50      0.05      0.04     0.00
"as.list.data.frame"            0.46      0.04      0.46     0.04
"makepredictcall.default"       0.46      0.04      0.46     0.04
"as.list"                       0.46      0.04      0.00     0.00
"vapply"                        0.46      0.04      0.00     0.00
"model.matrix.default"          0.44      0.04      0.10     0.01
"[["                            0.44      0.04      0.02     0.00
"model.matrix"                  0.44      0.04      0.00     0.00
....

$sample.interval
[1] 0.02

$sampling.time
[1] 1085.5

Note from the above that a big chunk of time is spent in function rpart.matrix:

> rpart:::rpart.matrix
function (frame) 
{
    if (!inherits(frame, "data.frame") || is.null(attr(frame, 
        "terms"))) 
        return(as.matrix(frame))
    for (i in 1:ncol(frame)) {
        if (is.character(frame[[i]])) 
            frame[[i]] <- as.numeric(factor(frame[[i]]))
        else if (!is.numeric(frame[[i]])) 
            frame[[i]] <- as.numeric(frame[[i]])
    }
    X <- model.matrix(attr(frame, "terms"), frame)[, -1L, drop = FALSE]
    colnames(X) <- sub("^`(.*)`", "\\1", colnames(X))
    class(X) <- c("rpart.matrix", class(X))
    X
}

But it is the for loop in that function where most of the time is spent, essentially converting each column and adding them back to the data frame.

Just building on @gavin simpson's discovery above... I decided to hack around with rpart.matrix, to see if I could do something about that excessive execution time.

The problem boils down to the use of a for loop. Normally I'm agnostic about for compared to [sl]apply; the latter is generally considered more elegant, but I'm not going to replace a for when it's working fine, just for that. In particular I think the performance benefits of *apply are sometimes overhyped; for has been improved significantly in terms of speed and memory use compared to the old S-Plus days.

Not in this case though. Simply replacing the for with lapply cuts the run time for this example by >2 orders of magnitude. Would be good to see if others can confirm this.

m <- model.frame(x.fac)


# call rpart.matrix
system.time(mm <- rpart:::rpart.matrix(m))
   user  system elapsed 
 208.25   88.03  296.99 


# exactly the same as rpart.matrix, but with for replaced by lapply
f <- function(frame)
{
    if (!inherits(frame, "data.frame") || is.null(attr(frame, 
        "terms"))) 
        return(as.matrix(frame))
    frame[] <- lapply(frame, function(x) {
        if (is.character(x))
            as.numeric(factor(x))
        else if(!is.numeric(x))
            as.numeric(x)
        else x
    })
    X <- model.matrix(attr(frame, "terms"), frame)[, -1L, drop = FALSE]
    colnames(X) <- sub("^`(.*)`", "\\1", colnames(X))
    class(X) <- c("rpart.matrix", class(X))
    X
}

system.time(mm2 <- f(m))
   user  system elapsed 
   0.65    0.04    0.70 


identical(mm, mm2)
[1] TRUE
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!