One of the problems I often face is needing to look up an arbitrary row from a data.table. I ran into a problem yesterday where I was trying to speed up a loop and using
The approach you have taken seems to be very inefficient because you are querying multiple times the single value from the dataset.
It would be much more efficient to query all of them at once and then just loop on the whole batch, instead of querying 1e4 one by one.
See dt2 for a vectorized approach. Still it is hard for me to imagine the use case for that.
Another thing is 450K rows of data is quite few to make a reasonable benchmark, you may get totally different results for 4M or higher. In terms of hash approach you would probably also hit memory limits faster.
Additionally the Sys.time()
may not be the best way to measure timing, read gc
argument in ?system.time
.
Here is the benchmark I've made using the system.nanotime()
function from microbenchmarkCore package.
It is possible to speed up data.table approach even further by collapsing test_lookup_list
into data.table and performing merge to test_lookup_dt
, but to compare to hash solution I would also need to preprocess it.
library(microbenchmarkCore) # install.packages("microbenchmarkCore", repos="http://olafmersmann.github.io/drat")
library(data.table)
library(hash)
# Set seed to 42 to ensure repeatability
set.seed(42)
# Setting up test ------
# Generate product ids
product_ids = as.vector(
outer(LETTERS[seq(1, 26, 1)],
outer(outer(LETTERS[seq(1, 26, 1)], LETTERS[seq(1, 26, 1)], paste, sep=""),
LETTERS[seq(1, 26, 1)], paste, sep = ""
), paste, sep = ""
)
)
# Create test lookup data
test_lookup_list = lapply(product_ids, function(id) list(
product_id = id,
val_1 = rnorm(1),
val_2 = rnorm(1),
val_3 = rnorm(1),
val_4 = rnorm(1),
val_5 = rnorm(1),
val_6 = rnorm(1),
val_7 = rnorm(1),
val_8 = rnorm(1)
))
# Set names of items in list
names(test_lookup_list) = sapply(test_lookup_list, `[[`, "product_id")
# Create lookup hash
lookup_hash = hash(names(test_lookup_list), test_lookup_list)
# Create data.table from list and set key of data.table to product_id field
test_lookup_dt <- rbindlist(test_lookup_list)
setkey(test_lookup_dt, product_id)
# Generate sample of keys to be used for speed testing
lookup_tests = lapply(1:10, function(x) sample(test_lookup_dt$product_id, 1e4))
native = lapply(lookup_tests, function(lookups) system.nanotime(for(lookup in lookups) test_lookup_list[[lookup]]))
dt1 = lapply(lookup_tests, function(lookups) system.nanotime(for(lookup in lookups) test_lookup_dt[lookup]))
hash = lapply(lookup_tests, function(lookups) system.nanotime(for(lookup in lookups) lookup_hash[[lookup]]))
dt2 = lapply(lookup_tests, function(lookups) system.nanotime(test_lookup_dt[lookups][, .SD, 1:length(product_id)]))
summary(sapply(native, `[[`, 3L))
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 27.65 28.15 28.47 28.97 28.78 33.45
summary(sapply(dt1, `[[`, 3L))
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 15.30 15.73 15.96 15.96 16.29 16.52
summary(sapply(hash, `[[`, 3L))
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 0.1209 0.1216 0.1221 0.1240 0.1225 0.1426
summary(sapply(dt2, `[[`, 3L))
# Min. 1st Qu. Median Mean 3rd Qu. Max.
#0.02421 0.02438 0.02445 0.02476 0.02456 0.02779
For a non-vectorized access pattern, you might want to try the builtin environment
objects:
require(microbenchmark)
test_lookup_env <- list2env(test_lookup_list)
x <- lookup_tests[[1]][1]
microbenchmark(
lookup_hash[[x]],
test_lookup_list[[x]],
test_lookup_dt[x],
test_lookup_env[[x]]
)
Here you can see it's even zippier than hash
:
Unit: microseconds
expr min lq mean median uq max neval
lookup_hash[[x]] 10.767 12.9070 22.67245 23.2915 26.1710 68.654 100
test_lookup_list[[x]] 847.700 853.2545 887.55680 863.0060 893.8925 1369.395 100
test_lookup_dt[x] 2652.023 2711.9405 2771.06400 2758.8310 2803.9945 3373.273 100
test_lookup_env[[x]] 1.588 1.9450 4.61595 2.5255 6.6430 27.977 100
EDIT:
Stepping through data.table:::`[.data.table`
is instructive why you are seeing dt slow down. When you index with a character and there is a key set, it does quite a bit of bookkeeping, then drops down into bmerge
, which is a binary search. Binary search is O(log n) and will get slower as n increases.
Environments, on the other hand, use hashing (by default) and have constant access time with respect to n.
To work around, you can manually build a map and index through it:
x <- lookup_tests[[2]][2]
e <- list2env(setNames(as.list(1:nrow(test_lookup_dt)), test_lookup_dt$product_id))
#example access:
test_lookup_dt[e[[x]], ]
However, seeing so much bookkeeping code in the data.table method, I'd try out plain old data.frames as well:
test_lookup_df <- as.data.frame(test_lookup_dt)
rownames(test_lookup_df) <- test_lookup_df$product_id
If we are really paranoid, we could skip the [
methods altogether and lapply over the columns directly.
Here are some more timings (from a different machine than above):
> microbenchmark(
+ test_lookup_dt[x,],
+ test_lookup_dt[x],
+ test_lookup_dt[e[[x]],],
+ test_lookup_df[x,],
+ test_lookup_df[e[[x]],],
+ lapply(test_lookup_df, `[`, e[[x]]),
+ lapply(test_lookup_dt, `[`, e[[x]]),
+ lookup_hash[[x]]
+ )
Unit: microseconds
expr min lq mean median uq max neval
test_lookup_dt[x, ] 1658.585 1688.9495 1992.57340 1758.4085 2466.7120 2895.592 100
test_lookup_dt[x] 1652.181 1695.1660 2019.12934 1764.8710 2487.9910 2934.832 100
test_lookup_dt[e[[x]], ] 1040.869 1123.0320 1356.49050 1280.6670 1390.1075 2247.503 100
test_lookup_df[x, ] 17355.734 17538.6355 18325.74549 17676.3340 17987.6635 41450.080 100
test_lookup_df[e[[x]], ] 128.749 151.0940 190.74834 174.1320 218.6080 366.122 100
lapply(test_lookup_df, `[`, e[[x]]) 18.913 25.0925 44.53464 35.2175 53.6835 146.944 100
lapply(test_lookup_dt, `[`, e[[x]]) 37.483 50.4990 94.87546 81.2200 124.1325 241.637 100
lookup_hash[[x]] 6.534 15.3085 39.88912 49.8245 55.5680 145.552 100
Overall, to answer your questions, you are not using data.table "wrong" but you are also not using it in the way it was intended (vectorized access). However, you can manually build a map to index through and get most of the performance back.