random-forest

All probability values are less than 0.5 on unseen data

泄露秘密 提交于 2021-01-28 23:25:13
问题 I have 15 features with a binary response variable and I am interested in predicting probabilities than 0 or 1 class labels. When I trained and tested the RF model with 500 trees, CV, balanced class weight, and balanced samples in the data frame, I achieved a good amount of accuracy and also good Brier score. As you can see in the image, the predicted probabilities values of class 1 on test data are in between 0 to 1. Here is the Histogram of predicted probabilities on test data: with

Why RandomForestClassifier on CPU (using SKLearn) and on GPU (using RAPIDs) get differents scores, very different?

拟墨画扇 提交于 2021-01-28 18:42:18
问题 I am using RandomForestClassifier on CPU with SKLearn and on GPU using RAPIDs. I am doing a benchmark between these two libraries about speed up and scoring using Iris dataset (it is a try, in the future, I will change the dataset for a better benchmarking, I am starting with these two libraries). The problem is when I measure the score on CPU always get a value of 1.0 but when I try to measure the score on GPU I get a variable value between 0.2 and 1.0 and I do not understand why could be it

R: variable exclusion from formula not working in presence of missing data

狂风中的少年 提交于 2021-01-28 14:12:49
问题 I'm building a model in R, while excluding 'office' column in the formula (it sometimes contains hints of the class I predict ). I'm learning on 'train' and predicting on 'test': > model <- randomForest::randomForest(tc ~ . - office, data=train, importance=TRUE,proximity=TRUE ) > prediction <- predict(model, test, type = "class") the prediction resulted with all NAs: > head(prediction) [1] <NA> <NA> <NA> <NA> <NA> <NA> Levels: 2668 2752 2921 3005 the reason is that test$office contains NAs: >

sklearn use RandomizedSearchCV with custom metrics and catch Exceptions

為{幸葍}努か 提交于 2021-01-28 12:35:42
问题 I am using the RandomizedSearchCV function in sklearn with a Random Forest Classifier. To see different metrics i am using a custom scoring from sklearn.metrics import make_scorer, roc_auc_score, recall_score, matthews_corrcoef, balanced_accuracy_score, accuracy_score acc = make_scorer(accuracy_score) auc_score = make_scorer(roc_auc_score) recall = make_scorer(recall_score) mcc = make_scorer(matthews_corrcoef) bal_acc = make_scorer(balanced_accuracy_score) scoring = {"roc_auc_score": auc

Random forest tree growing algorithm

为君一笑 提交于 2021-01-28 04:05:05
问题 I'm doing a Random Forest implementation (for classification), and I have some questions regarding the tree growing algorithm mentioned in literature. When training a decision tree, there are 2 criteria to stop growing a tree: a. Stop when there are no more features left to split a node on. b. Stop when the node has all samples in it belonging to the same class. Based on that, 1. Consider growing one tree in the forest. When splitting a node of the tree, I randomly select m of the M total

cforest party unbalanced classes

十年热恋 提交于 2021-01-27 08:59:27
问题 I want to measure the features importance with the cforest function from the party library. My output variable has something like 2000 samples in class 0 and 100 samples in class 1. I think a good way to avoid bias due to class unbalance is to train each tree of the forest using a subsample such that the number of elements of class 1 is the same of the number of element in class 0. Is there anyway to do that? I am thinking to an option like n_samples = c(20, 20) EDIT: An example of code >

How to run a random classifer in the following case

感情迁移 提交于 2021-01-05 08:54:33
问题 I am trying to experiment with sentiment analysis case and I am trying to run a random classifier for the following: |Topic |value|label| |Apples are great |-0.99|0 | |Balloon is red |-0.98|1 | |cars are running |-0.93|0 | |dear diary |0.8 |1 | |elephant is huge |0.91 |1 | |facebook is great |0.97 |0 | after splitting it into train test from sklearn library, I am doing the following for the Topic column for the count vectoriser to work upon it: x = train.iloc[:,0:2] #except for alphabets

How to run a random classifer in the following case

社会主义新天地 提交于 2021-01-05 08:53:59
问题 I am trying to experiment with sentiment analysis case and I am trying to run a random classifier for the following: |Topic |value|label| |Apples are great |-0.99|0 | |Balloon is red |-0.98|1 | |cars are running |-0.93|0 | |dear diary |0.8 |1 | |elephant is huge |0.91 |1 | |facebook is great |0.97 |0 | after splitting it into train test from sklearn library, I am doing the following for the Topic column for the count vectoriser to work upon it: x = train.iloc[:,0:2] #except for alphabets

How can I get the OOB samples used for each tree in random forest model R?

不羁岁月 提交于 2021-01-04 05:55:46
问题 Is it possible to get the OOB samples used by random forest algorithm for each tree ? I'm using R language. I know that RandomForest algorithm uses almost 66% of the data (selected randomly) to grow up each tree, and 34 % of the data as OOB samples to measure the OOB error, but I don't know how to get those OOB samples for each tree ? Any idea ? 回答1: Assuming you are using the randomForest package, you just need to set the keep.inbag argument to TRUE . library(randomForest) set.seed(1) rf <-

How can I get the OOB samples used for each tree in random forest model R?

落花浮王杯 提交于 2021-01-04 05:53:50
问题 Is it possible to get the OOB samples used by random forest algorithm for each tree ? I'm using R language. I know that RandomForest algorithm uses almost 66% of the data (selected randomly) to grow up each tree, and 34 % of the data as OOB samples to measure the OOB error, but I don't know how to get those OOB samples for each tree ? Any idea ? 回答1: Assuming you are using the randomForest package, you just need to set the keep.inbag argument to TRUE . library(randomForest) set.seed(1) rf <-