subsampling

How can SciKit-Learn Random Forest sub sample size may be equal to original training data size?

我们两清 提交于 2019-12-03 15:03:38
In the documentation of SciKit-Learn Random Forest classifier , it is stated that The sub-sample size is always the same as the original input sample size but the samples are drawn with replacement if bootstrap=True (default). What I dont understand is that if the sample size is always the same as the input sample size than how can we talk about a random selection. There is no selection here because we use all the (and naturally the same) samples at each training. Am I missing something here? Lol4t0 I believe this part of docs answers your question In random forests (see RandomForestClassifier

How can I subsample an array according to its density? (Remove frequent values, keep rare ones)

◇◆丶佛笑我妖孽 提交于 2019-11-28 10:37:35
I have this problem that I want to plot a data distribution where some values occur frequently while others are quite rare. The number of points in total is around 30.000. Rendering such a plot as png or (god forbid) pdf takes forever and the pdf is much too large to display. So I want to subsample the data just for the plots. What I would like to achieve is to remove a lot of points where they overlap (where the density is high), but keep the ones where the density is low with almost probability 1. Now, numpy.random.choice allows one to specify a vector of probabilities, which I've computed

Scikit-learn balanced subsampling

回眸只為那壹抹淺笑 提交于 2019-11-27 17:49:55
I'm trying to create N balanced random subsamples of my large unbalanced dataset. Is there a way to do this simply with scikit-learn / pandas or do I have to implement it myself? Any pointers to code that does this? These subsamples should be random and can be overlapping as I feed each to separate classifier in a very large ensemble of classifiers. In Weka there is tool called spreadsubsample, is there equivalent in sklearn? http://wiki.pentaho.com/display/DATAMINING/SpreadSubsample (I know about weighting but that's not what I'm looking for.) mikkom Here is my first version that seems to be

Scikit-learn balanced subsampling

筅森魡賤 提交于 2019-11-26 19:14:39
问题 I'm trying to create N balanced random subsamples of my large unbalanced dataset. Is there a way to do this simply with scikit-learn / pandas or do I have to implement it myself? Any pointers to code that does this? These subsamples should be random and can be overlapping as I feed each to separate classifier in a very large ensemble of classifiers. In Weka there is tool called spreadsubsample, is there equivalent in sklearn? http://wiki.pentaho.com/display/DATAMINING/SpreadSubsample (I know