R - caret createDataPartition returns more samples than expected

我们两清 提交于 2020-07-18 20:18:55

问题


I'm trying to split the iris dataset into a training set and a test set. I used createDataPartition() like this:

library(caret)
createDataPartition(iris$Species, p=0.1)
# [1]  12  22  26  41  42  57  63  79  89  93 114 117 134 137 142

createDataPartition(iris$Sepal.Length, p=0.1)
# [1]   1  27  44  46  54  68  72  77  83  84  93  99 104 109 117 132 134

I understand the first query. I have a vector of 0.1*150 elements (150 is the number of samples in the dataset). However, I should have the same vector on the second query but I am getting a vector of 17 elements instead of 15.

Any ideas as to why I get these results?


回答1:


Sepal.Length is a numeric feature; from the online documentation:

For numeric y, the sample is split into groups sections based on percentiles and sampling is done within these subgroups. For createDataPartition, the number of percentiles is set via the groups argument.

groups: for numeric y, the number of breaks in the quantiles

with default value:

groups = min(5, length(y))

Here is what happens in your case:

Since you do not specify groups, it takes a value of min(5, 150) = 5 breaks; now, in that case, these breaks coincide with the natural quantiles, i.e. the minimum, the 1st quantile, the median, the 3rd quantile, and the maximum - which you can see from the summary:

> summary(iris$Sepal.Length)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  4.300   5.100   5.800   5.843   6.400   7.900 

For numeric features, the function will take a percentage of p = 0.1 from each one of the (4) intervals defined by the above breaks (quantiles); let's see how many samples we have per such interval:

l1 = length(which(iris$Sepal.Length >= 4.3 & iris$Sepal.Length <= 5.1)) # 41
l2 = length(which(iris$Sepal.Length > 5.1 & iris$Sepal.Length <= 5.8))  # 39
l3 = length(which(iris$Sepal.Length > 5.8 & iris$Sepal.Length <= 6.4))  # 35
l4 = length(which(iris$Sepal.Length > 6.4 & iris$Sepal.Length <= 7.9))  # 35

Exactly how many samples will be returned from each interval? Here is the catch - according to line # 140 of the source code, it will be the ceiling of the product between the no. of samples and your p; let's see what this should be in your case for p = 0.1:

ceiling(l1*p) + ceiling(l2*p) + ceiling(l3*p) + ceiling(l4*p)
# 17

Bingo! :)



来源:https://stackoverflow.com/questions/46581379/r-caret-createdatapartition-returns-more-samples-than-expected

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!