How can I draw a random sample from a dataset, proportionate to size, based on different proportions for each value of a factor variable, in R

倾然丶 夕夏残阳落幕 提交于 2021-01-01 17:51:31

问题


I want to draw a random sample from my dataset, using different proportions for each value of a factor variable, as well as using weights stored in some other column. dplyr solution in pipes will be preferred as it can be inserted easily in long code.

Let's take the example of iris dataset. Species column is divided into three values 50 rows each. Let's also assume the sample weights are stored in column Sepal.Length. If I have to sample equal proportions (or equal rows) per species, the problem is easy to solve

library(tidyverse)

iris %>% group_by(Species) %>% slice_sample(prop = 0.1, weight_by = Sepal.Length)

# A tibble: 15 x 5
# Groups:   Species [3]
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species   
          <dbl>       <dbl>        <dbl>       <dbl> <fct>     
 1          5.4         3.7          1.5         0.2 setosa    
 2          5.3         3.7          1.5         0.2 setosa    
 3          5.7         4.4          1.5         0.4 setosa    
 4          5           3.5          1.6         0.6 setosa    
 5          4.8         3.1          1.6         0.2 setosa    
 6          6.1         2.9          4.7         1.4 versicolor
 7          6.7         3.1          4.7         1.5 versicolor
 8          5           2            3.5         1   versicolor
 9          7           3.2          4.7         1.4 versicolor
10          5.7         2.9          4.2         1.3 versicolor
11          7.2         3.2          6           1.8 virginica 
12          6.7         2.5          5.8         1.8 virginica 
13          6.4         2.8          5.6         2.1 virginica 
14          6.3         3.3          6           2.5 virginica 
15          7.2         3            5.8         1.6 virginica 

But I got stuck when I have to choose/sample different proportions for each species, say 10%, 20%, 25% respectively.

iris %>% group_by(Species) %>% slice_sample(prop = c(0.1, 0.2, 0.25), weight_by = Sepal.Length)

#Error: `prop` must be a single number

OR

iris %>% group_split(Species) %>% map_df(c(0.1, 0.2, 0.25), ~ slice_sample(prop = ., weight_by = Sepal.Length))
# A tibble: 0 x 0

Please help


回答1:


If I understand you right:

iris %>% 
  group_split(Species) %>% 
  map2(c(0.1, 0.2, 0.25), ~ slice_sample(.x, prop = .y))

[[1]]
# A tibble: 5 x 5
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
         <dbl>       <dbl>        <dbl>       <dbl> <fct>  
1          4.9         3            1.4         0.2 setosa 
2          4.8         3            1.4         0.1 setosa 
3          5.2         4.1          1.5         0.1 setosa 
4          5           3.5          1.6         0.6 setosa 
5          5.2         3.5          1.5         0.2 setosa 

[[2]]
# A tibble: 10 x 5
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species   
          <dbl>       <dbl>        <dbl>       <dbl> <fct>     
 1          6.3         2.5          4.9         1.5 versicolor
 2          5.5         2.6          4.4         1.2 versicolor
 3          6.9         3.1          4.9         1.5 versicolor
 4          6.6         2.9          4.6         1.3 versicolor
 5          6.1         3            4.6         1.4 versicolor
 6          5.7         2.8          4.5         1.3 versicolor
 7          6.7         3.1          4.4         1.4 versicolor
 8          5.1         2.5          3           1.1 versicolor
 9          5.7         3            4.2         1.2 versicolor
10          7           3.2          4.7         1.4 versicolor

[[3]]
# A tibble: 12 x 5
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species  
          <dbl>       <dbl>        <dbl>       <dbl> <fct>    
 1          6.4         3.2          5.3         2.3 virginica
 2          7.2         3.2          6           1.8 virginica
 3          6.3         3.3          6           2.5 virginica
 4          6.2         2.8          4.8         1.8 virginica
 5          7.6         3            6.6         2.1 virginica
 6          5.7         2.5          5           2   virginica
 7          4.9         2.5          4.5         1.7 virginica
 8          6.7         3.1          5.6         2.4 virginica
 9          7.7         2.8          6.7         2   virginica
10          6.7         3.3          5.7         2.5 virginica
11          6           3            4.8         1.8 virginica
12          5.6         2.8          4.9         2   virginica

Just change map2 to map2_df if you want a data frame returned:

iris %>% 
  group_split(Species) %>% 
  map2_df(c(0.1, 0.2, 0.25), ~ slice_sample(.x, prop = .y))

# A tibble: 27 x 5
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species   
          <dbl>       <dbl>        <dbl>       <dbl> <fct>     
 1          5.7         3.8          1.7         0.3 setosa    
 2          4.8         3.1          1.6         0.2 setosa    
 3          5.1         3.8          1.5         0.3 setosa    
 4          4.9         3.6          1.4         0.1 setosa    
 5          4.8         3.4          1.6         0.2 setosa    
 6          5.7         2.8          4.1         1.3 versicolor
 7          6.6         3            4.4         1.4 versicolor
 8          6.8         2.8          4.8         1.4 versicolor
 9          5.8         2.7          4.1         1   versicolor
10          6.4         3.2          4.5         1.5 versicolor
# ... with 17 more rows



回答2:


A similar solution using purrr.

First we specify our proportions for each Species.

props <- c(setosa=0.1, versicolor=0.2, virginica=0.5)

Then we iterate over each name-value pair in props using imap. For each pair in props, we filter the rows of data frame to only contain that species, and then sample the corresponding percentage that was specified using slice_sample.

imap_dfr(props,
         ~filter(iris, Species==.y) %>%
           slice_sample(prop=.x))

Using imap_dfr then puts together the three data frames (one for each species) into a single data frame.

Here's the result:

Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
1           4.8         3.1          1.6         0.2     setosa
2           5.0         3.5          1.3         0.3     setosa
3           5.1         3.8          1.6         0.2     setosa
4           5.0         3.6          1.4         0.2     setosa
5           4.9         3.1          1.5         0.2     setosa
6           6.7         3.1          4.7         1.5 versicolor
7           5.7         2.8          4.1         1.3 versicolor
8           6.1         3.0          4.6         1.4 versicolor
9           5.6         3.0          4.5         1.5 versicolor
10          6.6         2.9          4.6         1.3 versicolor
11          5.5         2.6          4.4         1.2 versicolor
12          6.7         3.0          5.0         1.7 versicolor
13          5.7         2.6          3.5         1.0 versicolor
14          5.9         3.2          4.8         1.8 versicolor
15          5.4         3.0          4.5         1.5 versicolor
16          5.8         2.8          5.1         2.4  virginica
17          6.7         3.3          5.7         2.1  virginica
18          7.4         2.8          6.1         1.9  virginica
19          6.4         2.8          5.6         2.1  virginica
20          6.7         3.1          5.6         2.4  virginica
21          6.1         3.0          4.9         1.8  virginica
22          6.0         2.2          5.0         1.5  virginica
23          6.3         2.7          4.9         1.8  virginica
24          6.3         2.8          5.1         1.5  virginica
25          7.2         3.2          6.0         1.8  virginica
26          7.7         2.6          6.9         2.3  virginica
27          5.8         2.7          5.1         1.9  virginica
28          4.9         2.5          4.5         1.7  virginica
29          6.7         3.0          5.2         2.3  virginica
30          7.7         3.8          6.7         2.2  virginica
31          6.9         3.1          5.4         2.1  virginica
32          5.8         2.7          5.1         1.9  virginica
33          6.8         3.0          5.5         2.1  virginica
34          6.3         2.5          5.0         1.9  virginica
35          6.9         3.1          5.1         2.3  virginica
36          6.3         3.3          6.0         2.5  virginica
37          7.6         3.0          6.6         2.1  virginica
38          6.5         3.0          5.5         1.8  virginica
39          7.7         2.8          6.7         2.0  virginica
40          6.5         3.2          5.1         2.0  virginica



回答3:


You can keep the information of proportion in the dataframe itself and sample rows from it.

library(dplyr)

iris %>%
  distinct(Species) %>%
  mutate(prop = c(0.1, 0.2, 0.25)) %>%
  inner_join(iris, by = 'Species') %>%
  group_by(Species) %>%
  sample_n(first(prop)*n()) -> result

result %>% count(Species)

#  Species        n
#  <fct>      <int>
#1 setosa         5
#2 versicolor    10
#3 virginica     12

I expected slice_sample(prop = first(prop)) to work but it doesn't hence, I used sample_n.



来源:https://stackoverflow.com/questions/65272564/how-can-i-draw-a-random-sample-from-a-dataset-proportionate-to-size-based-on-d

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!