subsetting in data.table

后端未结

关注

 4  1686

I am trying to subset a data.table ( from the package data.table ) in R (not a data.frame). I have a 4 digit year as a key. I would like to subset by taking a series of ye

相关标签:

4条回答

孤街浪徒

2020-12-08 22:12
What works for data.frames works for data.tables.
```
subset(DT, year %in% 1999:2001)
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
春和景丽

2020-12-08 22:22
The question is not clear and does not provide sufficient data to work with BUT it is usefull, so if some one can edit it with the data I provide hereafter, one is welcome. The title of the post could also be completed : Matthew Dowle often answers the subsetting-over-two-vectors question, but less frequently the subsetting-according-a-in-statement-on-one-vector one. I have been looking a while for an answer, untill finding one for character vectors here.

Let's consider this data :
```
library(data.table)
n <- 100
X <- data.table(a=sample(c(10,20,25,30,40),n,replace=TRUE),b=1:n)
```
The data.table-style query corresponding to X[X$a %in% c(10,20),] is somehow surprising :
```
setkey(X,a)
X[.(c(10,20))]
X[.(10,20)] # works for characters but not for integers
            # instead, treats 10 as the filter
            # and 20 as a new variable

# for comparison :
X[X$a %in% c(10,20),]
```
Now, which is best? If your key is already set, data.table, obviously. Otherwise, it might not, as prove the following time-measurements (on my 1,75 Go RAM computer) :
```
n <- 1e7
X <- data.table(a=sample(c(10,20,25,30,40),n,replace=TRUE),b=1:n)
system.time(X[X$a %in% c(10,20),])
# utilisateur     système      écoulé (yes, I'm French) 
#        1.92        0.06        1.99
system.time(setkey(X,a))
# utilisateur     système      écoulé 
#       34.91        0.05       35.23 
system.time(X[J(c(10,20))])
# utilisateur     système      écoulé 
#        0.15        0.08        0.23
```
But maybe Matthew has better solutions...

[Matthew] You've discovered that sorting type numeric (a.k.a. double) is much slower than integer. For many years we didn't allow double in keys for fear of users falling into this trap and reporting terrible timings like this. We allowed double in keys with some trepidation because fast sorting isn't implemented for double yet. Fast sorting on integer and character is pretty good because those are done using a counting sort. ~~Hopefully we'll get to fast sorting numeric one day!~~ (Now implemented - see below).

Timings on data.table pre-1.9.0
```
n <- 1e7
X <- data.table(a=sample(c(10,20,25,30,40),n,replace=TRUE),b=1:n)      
system.time(setkey(X,a))
#   user  system elapsed 
# 13.898   0.138  14.216 

X <- data.table(a=sample(as.integer(c(10,20,25,30,40)),n,replace=TRUE),b=1:n)
system.time(setkey(X,a))
#   user  system elapsed 
#  0.381   0.019   0.408 
```
Rememeber that 2 is type numeric in R by default. 2L is integer. Although data.table accepts numeric it still much prefers integer.

Fast radix sort for numerics is implemented since v1.9.0.

From v1.9.0 on
```
n <- 1e7
X <- data.table(a=sample(c(10,20,25,30,40),n,replace=TRUE),b=1:n)      
system.time(setkey(X,a))
#    user  system elapsed 
#   0.832   0.026   0.871 
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
花落未央

2020-12-08 22:23

Like the above, but more data.table esque:

DT[year %in% c(1999, 2000, 2001)]

0 讨论(0)
发布评论:

提交评论
- 加载中...
滥情空心

2020-12-08 22:25
This will work:
```
sample_DT = data.table(year = rep(1990:2010, length.out = 1000), 
                       random_number = rnorm(1000), key = "year")
year_subset = sample_DT[J(c(1990, 1995, 1997))]
```
Similarly, you can key an already existing data.table with setkey(existing_DT, year) and then use the J() syntax as shown above.

I think the problem may be that you didn't key the data first.
0 讨论(0)
发布评论:

提交评论
- 加载中...

subsetting in data.table

Timings on data.table pre-1.9.0

From v1.9.0 on