问题
I would like to inspect all the observations that reached some node in an rpart decision tree. For example, in the following code:
fit <- rpart(Kyphosis ~ Age + Start, data = kyphosis)
fit
n= 81
node), split, n, loss, yval, (yprob)
* denotes terminal node
1) root 81 17 absent (0.79012346 0.20987654)
2) Start>=8.5 62 6 absent (0.90322581 0.09677419)
4) Start>=14.5 29 0 absent (1.00000000 0.00000000) *
5) Start< 14.5 33 6 absent (0.81818182 0.18181818)
10) Age< 55 12 0 absent (1.00000000 0.00000000) *
11) Age>=55 21 6 absent (0.71428571 0.28571429)
22) Age>=111 14 2 absent (0.85714286 0.14285714) *
23) Age< 111 7 3 present (0.42857143 0.57142857) *
3) Start< 8.5 19 8 present (0.42105263 0.57894737) *
I would like to see all the observations in node (5) (i.e.: the 33 observations for which Start>=8.5 & Start< 14.5). Obviously I could manually get to them. But I would like to have some function like (say) "get_node_date". For which I could just run get_node_date(5) - and get the relevant observations.
Any suggestions on how to go about this?
回答1:
There seems to be no such function which enables an extraction of the observations from a specific node. I would solve it as follows: first determine which rule/s is/are used for the node you are insterested in. You can use path.rpart
for it. Then you could apply the rule/s one after the other to extract the observations.
This approach as a function:
get_node_date <- function(tree = fit, node = 5){
rule <- path.rpart(tree, node)
rule_2 <- sapply(rule[[1]][-1], function(x) strsplit(x, '(?<=[><=])(?=[^><=])|(?<=[^><=])(?=[><=])', perl = TRUE))
ind <- apply(do.call(cbind, lapply(rule_2, function(x) eval(call(x[2], kyphosis[,x[1]], as.numeric(x[3]))))), 1, all)
kyphosis[ind,]
}
For node 5 you get:
get_node_date()
node number: 5
root
Start>=8.5
Start< 14.5
Kyphosis Age Number Start
2 absent 158 3 14
10 present 59 6 12
11 present 82 5 14
14 absent 1 4 12
18 absent 175 5 13
20 absent 27 4 9
23 present 96 3 12
26 absent 9 5 13
28 absent 100 3 14
32 absent 125 2 11
33 absent 130 5 13
35 absent 140 5 11
37 absent 1 3 9
39 absent 20 6 9
40 present 91 5 12
42 absent 35 3 13
46 present 139 3 10
48 absent 131 5 13
50 absent 177 2 14
51 absent 68 5 10
57 absent 2 3 13
59 absent 51 7 9
60 absent 102 3 13
66 absent 17 4 10
68 absent 159 4 13
69 absent 18 4 11
71 absent 158 5 14
72 absent 127 4 12
74 absent 206 4 10
77 present 157 3 13
78 absent 26 7 13
79 absent 120 2 13
81 absent 36 4 13
回答2:
rpart returns rpart.object element which contains the information you need:
require(rpart)
fit2 <- rpart(Kyphosis ~ Age + Start, data = kyphosis)
fit2
get_node_date <-function(nodeId,fit)
{
fit$frame[toString(nodeId),"n"]
}
for (i in c(1,2,4,5,10,11,22,23,3) )
cat(get_node_date(i,fit2),"\n")
回答3:
The partykit
package also provides a canned solution for this. You just need to convert the rpart
object to the party
class in order to use its unified interface for dealing with trees. And then you can use the data_party()
function.
Using the fit
from the question and having loaded library("partykit")
you can first coerce the rpart
tree to party
:
pfit <- as.party(fit)
plot(pfit)
There are only two small nuisances for extracting the data in the way you want: (1) The model.frame()
from the original fit is always dropped in the coercion and needs to be reattached manually. (2) A different numbering scheme is used for the nodes. You want node 4 (rather than 5) now.
pfit$data <- model.frame(fit)
data4 <- data_party(pfit, 4)
dim(data4)
## [1] 33 5
head(data4)
## Kyphosis Age Start (fitted) (response)
## 2 absent 158 14 7 absent
## 10 present 59 12 8 present
## 11 present 82 14 8 present
## 14 absent 1 12 5 absent
## 18 absent 175 13 7 absent
## 20 absent 27 9 5 absent
Another route is to subset the subtree starting from node 4 and then taking the data from that:
pfit4 <- pfit[4]
plot(pfit4)
Then data_party(pfit4)
gives you the same as data4
above. And pfit4$data
gives you the data without the (fitted)
node and the predicted (response)
.
回答4:
Yet another way, this works by finding all of the terminal nodes of any particular node and returning the subset of data used in the call.
fit <- rpart(Kyphosis ~ Age + Start, data = kyphosis)
head(subset.rpart(fit, 5))
# Kyphosis Age Number Start
# 2 absent 158 3 14
# 10 present 59 6 12
# 11 present 82 5 14
# 14 absent 1 4 12
# 18 absent 175 5 13
# 20 absent 27 4 9
subset.rpart <- function(tree, node = 1L) {
data <- eval(tree$call$data, parent.frame(1L))
wh <- sapply(as.integer(rownames(tree$frame)), parent)
wh <- unique(unlist(wh[sapply(wh, function(x) node %in% x)]))
data[rownames(tree$frame)[tree$where] %in% wh[wh >= node], ]
}
parent <- function(x) {
if (x[1] != 1)
c(Recall(if (x %% 2 == 0L) x / 2 else (x - 1) / 2), x) else x
}
回答5:
Two years after original post, but may be of use to others. Node assignments for training observations in rpart can be obtained from $where
:
fit <- rpart(Kyphosis ~ Age + Start, data = kyphosis)
fit$where
As a function:
get_node <- function(rpart.object=fit, data=kyphosis, node.number=5) {
data[which(fit$where == node.number),]
}
get_node()
This works for training observations only though, not for new observations.
来源:https://stackoverflow.com/questions/36748531/getting-the-observations-in-a-rparts-node-i-e-cart