问题
I need to categorize a continuous variable in 4 classes each one with the same number of observations. I have used the function
cut(x, breaks = quantile(x,probs=seq(0,1,0.25)),include.lowest=TRUE,right=FALSE))
My problem is that the number of observations in each category is not exactly the same because there are observations (and more than one) which have exactly the same value of the quantiles. How can I do it?
My variable is waiting
[1] 79 54 74 62 85 55 88 85 51 85 54 84 78 47 83 52 62 84 52 79 51 47 78 69 74
[26] 83 55 76 78 79 73 77 66 80 74 52 48 80 59 90 80 58 84 58 73 83 64 53 82 59
[51] 75 90 54 80 54 83 71 64 77 81 59 84 48 82 60 92 78 78 65 73 82 56 79 71 62
[76] 76 60 78 76 83 75 82 70 65 73 88 76 80 48 86 60 90 50 78 63 72 84 75 51 82
[101] 62 88 49 83 81 47 84 52 86 81 75 59 89 79 59 81 50 85 59 87 53 69 77 56 88
[126] 81 45 82 55 90 45 83 56 89 46 82 51 86 53 79 81 60 82 77 76 59 80 49 96 53
[151] 77 77 65 81 71 70 81 93 53 89 45 86 58 78 66 76 63 88 52 93 49 57 77 68 81
[176] 81 73 50 85 74 55 77 83 83 51 78 84 46 83 55 81 57 76 84 77 81 87 77 51 78
[201] 60 82 91 53 78 46 77 84 49 83 71 80 49 75 64 76 53 94 55 76 50 82 54 75 78
[226] 79 78 78 70 79 70 54 86 50 90 54 54 77 79 64 75 47 86 63 85 82 57 82 67 74
[251] 54 83 73 73 88 80 71 83 56 79 78 84 58 83 43 60 75 81 46 90 46 74
which is in the dataset faithful in R. It has 272 observations, therefore it is divisible by 4 giving 68 observations in each category.
I have used
newwait<-cut(waiting, breaks =quantile(waiting,probs=seq(0,1,0.25)),include.lowest=TRUE,right=FALSE)
table(newwait)
newwait
[43,58) [58,76) [76,82) [82,96]
66 68 67 71
as you can see, the number of observations in each group is similar but not exactly the same.
回答1:
Basically, it sounds like you need to deal with ties. You also need to have a vector whose length, when divided by 4, yields an integer...but I'll assume you know that.
Here's a solution using the tie-breaking functions of rank
:
set.seed(1)
x <- round(runif(1000,0,1),1)
table(x)
## x
## 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
## 43 106 95 103 112 109 82 102 95 100 53
y <- rank(x, ties.method='first') # <- this forces tie breaks
cuts <- cut(y, breaks = quantile(y,probs=seq(0,1,0.25)),
include.lowest=TRUE,
right=FALSE)
# check that cuts are all the same length:
lapply(split(x,cuts), length)
$`[1,251)`
[1] 250
$`[251,500)`
[1] 250
$`[500,750)`
[1] 250
$`[750,1e+03]`
[1] 250
来源:https://stackoverflow.com/questions/19883443/how-to-categorize-a-continuous-variable-in-4-groups-of-the-same-size-in-r