问题
I want to draw a random sample from my dataset, using different proportions for each value of a factor variable, as well as using weights stored in some other column. dplyr
solution in pipes will be preferred as it can be inserted easily in long code.
Let's take the example of iris
dataset. Species
column is divided into three values 50 rows each. Let's also assume the sample weights are stored in column Sepal.Length
. If I have to sample equal proportions (or equal rows) per species, the problem is easy to solve
library(tidyverse)
iris %>% group_by(Species) %>% slice_sample(prop = 0.1, weight_by = Sepal.Length)
# A tibble: 15 x 5
# Groups: Species [3]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <fct>
1 5.4 3.7 1.5 0.2 setosa
2 5.3 3.7 1.5 0.2 setosa
3 5.7 4.4 1.5 0.4 setosa
4 5 3.5 1.6 0.6 setosa
5 4.8 3.1 1.6 0.2 setosa
6 6.1 2.9 4.7 1.4 versicolor
7 6.7 3.1 4.7 1.5 versicolor
8 5 2 3.5 1 versicolor
9 7 3.2 4.7 1.4 versicolor
10 5.7 2.9 4.2 1.3 versicolor
11 7.2 3.2 6 1.8 virginica
12 6.7 2.5 5.8 1.8 virginica
13 6.4 2.8 5.6 2.1 virginica
14 6.3 3.3 6 2.5 virginica
15 7.2 3 5.8 1.6 virginica
But I got stuck when I have to choose/sample different proportions for each species, say 10%, 20%, 25% respectively.
iris %>% group_by(Species) %>% slice_sample(prop = c(0.1, 0.2, 0.25), weight_by = Sepal.Length)
#Error: `prop` must be a single number
OR
iris %>% group_split(Species) %>% map_df(c(0.1, 0.2, 0.25), ~ slice_sample(prop = ., weight_by = Sepal.Length))
# A tibble: 0 x 0
Please help
回答1:
If I understand you right:
iris %>%
group_split(Species) %>%
map2(c(0.1, 0.2, 0.25), ~ slice_sample(.x, prop = .y))
[[1]]
# A tibble: 5 x 5
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <fct>
1 4.9 3 1.4 0.2 setosa
2 4.8 3 1.4 0.1 setosa
3 5.2 4.1 1.5 0.1 setosa
4 5 3.5 1.6 0.6 setosa
5 5.2 3.5 1.5 0.2 setosa
[[2]]
# A tibble: 10 x 5
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <fct>
1 6.3 2.5 4.9 1.5 versicolor
2 5.5 2.6 4.4 1.2 versicolor
3 6.9 3.1 4.9 1.5 versicolor
4 6.6 2.9 4.6 1.3 versicolor
5 6.1 3 4.6 1.4 versicolor
6 5.7 2.8 4.5 1.3 versicolor
7 6.7 3.1 4.4 1.4 versicolor
8 5.1 2.5 3 1.1 versicolor
9 5.7 3 4.2 1.2 versicolor
10 7 3.2 4.7 1.4 versicolor
[[3]]
# A tibble: 12 x 5
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <fct>
1 6.4 3.2 5.3 2.3 virginica
2 7.2 3.2 6 1.8 virginica
3 6.3 3.3 6 2.5 virginica
4 6.2 2.8 4.8 1.8 virginica
5 7.6 3 6.6 2.1 virginica
6 5.7 2.5 5 2 virginica
7 4.9 2.5 4.5 1.7 virginica
8 6.7 3.1 5.6 2.4 virginica
9 7.7 2.8 6.7 2 virginica
10 6.7 3.3 5.7 2.5 virginica
11 6 3 4.8 1.8 virginica
12 5.6 2.8 4.9 2 virginica
Just change map2
to map2_df
if you want a data frame returned:
iris %>%
group_split(Species) %>%
map2_df(c(0.1, 0.2, 0.25), ~ slice_sample(.x, prop = .y))
# A tibble: 27 x 5
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <fct>
1 5.7 3.8 1.7 0.3 setosa
2 4.8 3.1 1.6 0.2 setosa
3 5.1 3.8 1.5 0.3 setosa
4 4.9 3.6 1.4 0.1 setosa
5 4.8 3.4 1.6 0.2 setosa
6 5.7 2.8 4.1 1.3 versicolor
7 6.6 3 4.4 1.4 versicolor
8 6.8 2.8 4.8 1.4 versicolor
9 5.8 2.7 4.1 1 versicolor
10 6.4 3.2 4.5 1.5 versicolor
# ... with 17 more rows
回答2:
A similar solution using purrr
.
First we specify our proportions for each Species
.
props <- c(setosa=0.1, versicolor=0.2, virginica=0.5)
Then we iterate over each name-value pair in props
using imap
. For each pair in props
, we filter the rows of data frame to only contain that species, and then sample the corresponding percentage that was specified using slice_sample
.
imap_dfr(props,
~filter(iris, Species==.y) %>%
slice_sample(prop=.x))
Using imap_dfr
then puts together the three data frames (one for each species) into a single data frame.
Here's the result:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 4.8 3.1 1.6 0.2 setosa
2 5.0 3.5 1.3 0.3 setosa
3 5.1 3.8 1.6 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
5 4.9 3.1 1.5 0.2 setosa
6 6.7 3.1 4.7 1.5 versicolor
7 5.7 2.8 4.1 1.3 versicolor
8 6.1 3.0 4.6 1.4 versicolor
9 5.6 3.0 4.5 1.5 versicolor
10 6.6 2.9 4.6 1.3 versicolor
11 5.5 2.6 4.4 1.2 versicolor
12 6.7 3.0 5.0 1.7 versicolor
13 5.7 2.6 3.5 1.0 versicolor
14 5.9 3.2 4.8 1.8 versicolor
15 5.4 3.0 4.5 1.5 versicolor
16 5.8 2.8 5.1 2.4 virginica
17 6.7 3.3 5.7 2.1 virginica
18 7.4 2.8 6.1 1.9 virginica
19 6.4 2.8 5.6 2.1 virginica
20 6.7 3.1 5.6 2.4 virginica
21 6.1 3.0 4.9 1.8 virginica
22 6.0 2.2 5.0 1.5 virginica
23 6.3 2.7 4.9 1.8 virginica
24 6.3 2.8 5.1 1.5 virginica
25 7.2 3.2 6.0 1.8 virginica
26 7.7 2.6 6.9 2.3 virginica
27 5.8 2.7 5.1 1.9 virginica
28 4.9 2.5 4.5 1.7 virginica
29 6.7 3.0 5.2 2.3 virginica
30 7.7 3.8 6.7 2.2 virginica
31 6.9 3.1 5.4 2.1 virginica
32 5.8 2.7 5.1 1.9 virginica
33 6.8 3.0 5.5 2.1 virginica
34 6.3 2.5 5.0 1.9 virginica
35 6.9 3.1 5.1 2.3 virginica
36 6.3 3.3 6.0 2.5 virginica
37 7.6 3.0 6.6 2.1 virginica
38 6.5 3.0 5.5 1.8 virginica
39 7.7 2.8 6.7 2.0 virginica
40 6.5 3.2 5.1 2.0 virginica
回答3:
You can keep the information of proportion in the dataframe itself and sample rows from it.
library(dplyr)
iris %>%
distinct(Species) %>%
mutate(prop = c(0.1, 0.2, 0.25)) %>%
inner_join(iris, by = 'Species') %>%
group_by(Species) %>%
sample_n(first(prop)*n()) -> result
result %>% count(Species)
# Species n
# <fct> <int>
#1 setosa 5
#2 versicolor 10
#3 virginica 12
I expected slice_sample(prop = first(prop))
to work but it doesn't hence, I used sample_n
.
来源:https://stackoverflow.com/questions/65272564/how-can-i-draw-a-random-sample-from-a-dataset-proportionate-to-size-based-on-d