If speed is a concern, a self join to fill in missing levels could be an option (see section 3.5.5 of Frank's Quick R Tutorial):
library(data.table)
setDT(df)[CJ(day = day, product = product, unique = TRUE), on = .(day, product)][
is.na(sales), sales := 0.0][]
day product sales
1: a 1 0.57406950
2: a 2 0.04390324
3: a 3 0.63809278
4: b 1 0.00000000
5: b 2 0.01203568
6: b 3 0.61310815
7: c 1 0.19049274
8: c 2 0.61758172
9: c 3 0.00000000
Benchmark
Create benchmark data of 1 million rows minus 10% missing = 0.9 M rows:
n_day <- 1e3L
n_prod <- 1e3L
n_rows <- n_day * n_prod
# how many rows to remove?
n_miss <- n_rows / 10L
set.seed(1L)
df <- expand.grid(day = 1:n_day, product = 1:n_prod)
df$sales <- runif(n_rows)
#remove rows
df <- df[-sample.int(n_rows, n_miss), ]
str(df)
'data.frame': 900000 obs. of 3 variables:
$ day : int 1 2 3 5 6 7 8 9 11 12 ...
$ product: int 1 1 1 1 1 1 1 1 1 1 ...
$ sales : num 0.266 0.372 0.573 0.202 0.898 ...
- attr(*, "out.attrs")=List of 2
..$ dim : Named int 1000 1000
.. ..- attr(*, "names")= chr "day" "product"
..$ dimnames:List of 2
.. ..$ day : chr "day= 1" "day= 2" "day= 3" "day= 4" ...
.. ..$ product: chr "product= 1" "product= 2" "product= 3" "product= 4" ...
Define check function:
my_check <- function(values) {
all(sapply(values[-1], function(x) identical(as.data.frame(values[[1]]), as.data.frame(x))))
}
Run benchmarks:
library(data.table)
microbenchmark::microbenchmark(
tidyr = tidyr::complete(df, day, product, fill = list(sales = 0)),
dt = setDT(df)[CJ(day = day, product = product, unique = TRUE), on = .(day, product)][
is.na(sales), sales := 0.0][],
times = 3L,
check = my_check
)
Unit: milliseconds
expr min lq mean median uq max neval cld
tidyr 1253.3395 1258.0595 1323.5438 1262.7794 1358.6459 1454.5124 3 b
dt 94.4451 100.2952 155.4575 106.1452 185.9638 265.7823 3 a
For the given problem size of 1 M rows minus 10% missing, the tidyr
solution is a magnitude slower than the data.table
approach.