Inserting rows into data frame when values missing in category [duplicate]

后端未结

关注

 2  1425

Benchmark

Create benchmark data of 1 million rows minus 10% missing = 0.9 M rows:

n_day <-  1e3L
n_prod <- 1e3L
n_rows <- n_day * n_prod
# how many rows to remove?
n_miss <- n_rows / 10L
set.seed(1L)
df <- expand.grid(day = 1:n_day, product = 1:n_prod)
df$sales <- runif(n_rows)
#remove rows
df <- df[-sample.int(n_rows, n_miss), ]
str(df)

'data.frame': 900000 obs. of  3 variables:
 $ day    : int  1 2 3 5 6 7 8 9 11 12 ...
 $ product: int  1 1 1 1 1 1 1 1 1 1 ...
 $ sales  : num  0.266 0.372 0.573 0.202 0.898 ...
 - attr(*, "out.attrs")=List of 2
  ..$ dim     : Named int  1000 1000
  .. ..- attr(*, "names")= chr  "day" "product"
  ..$ dimnames:List of 2
  .. ..$ day    : chr  "day=   1" "day=   2" "day=   3" "day=   4" ...
  .. ..$ product: chr  "product=   1" "product=   2" "product=   3" "product=   4" ...

Define check function:

my_check <- function(values) {
  all(sapply(values[-1], function(x) identical(as.data.frame(values[[1]]), as.data.frame(x))))
}

Run benchmarks:

library(data.table)
microbenchmark::microbenchmark(
  tidyr = tidyr::complete(df, day, product, fill = list(sales = 0)),
  dt = setDT(df)[CJ(day = day, product = product, unique = TRUE), on = .(day, product)][
    is.na(sales), sales := 0.0][],
  times = 3L,
  check = my_check
)

Unit: milliseconds
  expr       min        lq      mean    median        uq       max neval cld
 tidyr 1253.3395 1258.0595 1323.5438 1262.7794 1358.6459 1454.5124     3   b
    dt   94.4451  100.2952  155.4575  106.1452  185.9638  265.7823     3  a

For the given problem size of 1 M rows minus 10% missing, the tidyr solution is a magnitude slower than the data.table approach.

0 讨论(0)

傲寒

2020-12-10 10:25
Option 1

Thanks to @Frank for the better solution, using tidyr:
```
library(tidyr)
complete(df, day, product, fill = list(sales = 0))
```
Using this approach, you no longer need to worry about selecting product names, etc.

Which gives you:
```
  day product      sales
1   a       1 0.52042809
2   b       1 0.00000000
3   c       1 0.46373882
4   a       2 0.11155348
5   b       2 0.04937618
6   c       2 0.26433153
7   a       3 0.69100939
8   b       3 0.90596172
9   c       3 0.00000000
```
Option 2

You can do this using the tidyr package (and dplyr)
```
df %>% 
  spread(product, sales, fill = 0) %>% 
  gather(`1`:`3`, key = "product", value = "sales")
```
Which gives the same result

This works by using spread to create a wide data frame, with each product as its own column. The argument fill = 0 will cause all empty cells to be filled with a 0 (the default is NA).

Next, gather works to convert the 'wide' data frame back into the original 'long' data frame. The first argument is the columns of the products (in this case '1':'3'). We then set the key and value to the original column names.

I would suggestion option 1, but option 2 might still prove to have some use in certain circumstances.

Both options should work for all days you have at least one sale recorded. If there are missing days, I suggest you look into the package padr and then using the above tidyr to do the rest.
0 讨论(0)
发布评论:

提交评论
- 加载中...

热议问题