问题
I want to measure the features importance with the cforest function from the party library.
My output variable has something like 2000 samples in class 0 and 100 samples in class 1.
I think a good way to avoid bias due to class unbalance is to train each tree of the forest using a subsample such that the number of elements of class 1 is the same of the number of element in class 0.
Is there anyway to do that? I am thinking to an option like n_samples = c(20, 20)
EDIT: An example of code
> iris.cf <- cforest(Species ~ ., data = iris,
+ control = cforest_unbiased(mtry = 2)) #<--- Here I would like to train the forest using a balanced subsample of the data
> varimp(object = iris.cf)
Sepal.Length Sepal.Width Petal.Length Petal.Width
0.048981818 0.002254545 0.305818182 0.271163636
>
EDIT: Maybe my question is not clear enough. Random forest is a set of decision trees. In general the decision trees are constructed using only a random subsample of the data. I would like that the used subsample has the same numbers of element in the class 1 and in the class 0.
EDIT: The function that I am looking for is for sure available in the randomForest package
sampsize
Size(s) of sample to draw. For classification, if sampsize is a vector of the length the number of strata, then sampling is stratified by strata, and the elements of sampsize indicate the numbers to be drawn from the strata.
I need the same for the party package. Is there any way to get it?
回答1:
I will assume you know what you want to accomplish, but don't know enough R to do that.
Not sure if the function provides balancing of data as an argument, but you can do it manually. Below is the code I quickly threw together. More elegant solution might exist.
# just in case
myData <- iris
# replicate everything *10* times. Replicate is just a "loop 10 times".
replicate(10,
{
# split dataset by class and add separate classes to list
splitList <- split(myData, myData$Species)
# sample *20* random rows from each matrix in a list
sampledList <- lapply(splitList, function(dat) { dat[sample(20),] })
# combine sampled rows to a data.frame
sampledData <- do.call(rbind, sampledList)
# your code below
res.cf <- cforest(Species ~ ., data = sampledData,
control = cforest_unbiased(mtry = 2)
)
varimp(object = res.cf)
}
)
Hope you can take it from here.
来源:https://stackoverflow.com/questions/26393675/cforest-party-unbalanced-classes