Summarize data.table by group

前端未结

关注

 2  609

I am working with a huge data table in R containing monthly measurements of temperature for multiple locations, taken by different sources.

The dataset looks like this:<

相关标签:

2条回答

情深已故

2021-02-15 12:54
I don't think you generated your test data correctly. The function expand.grid() takes a cartesian product of all arguments. I'm not sure why you included the Temperature=temp argument in the expand.grid() call; that duplicates each temperature value for every single key combination, resulting in a data.table with 9 million rows (this is (10*60*5)^2). I think you intended one temperature value per key, which should result in 10*60*5 rows:
```
df <- data.table(expand.grid(Location=loc,Date=dates,Model=mods),Temperature=temp);
df;
##       Location       Date Model Temperature
##    1:        1 2000-01-01     A    2.469751
##    2:        2 2000-01-01     A   16.103135
##    3:        3 2000-01-01     A    7.147051
##    4:        4 2000-01-01     A   10.301937
##    5:        5 2000-01-01     A   16.760238
##   ---
## 2996:        6 2004-12-01     E   26.293968
## 2997:        7 2004-12-01     E    8.446528
## 2998:        8 2004-12-01     E   29.003001
## 2999:        9 2004-12-01     E   12.076765
## 3000:       10 2004-12-01     E   28.410980
```
If this is correct, you can generate the means across models with this:
```
df[,.(Mean=mean(Temperature)),.(Location,Date)];
##      Location       Date      Mean
##   1:        1 2000-01-01  9.498497
##   2:        2 2000-01-01 11.744622
##   3:        3 2000-01-01 15.691228
##   4:        4 2000-01-01 11.457154
##   5:        5 2000-01-01  8.897931
##  ---
## 596:        6 2004-12-01 17.587000
## 597:        7 2004-12-01 19.555963
## 598:        8 2004-12-01 15.710465
## 599:        9 2004-12-01 15.322790
## 600:       10 2004-12-01 20.240392
```
Note that the := operator does not actually aggregate. It only adds, modifies, or deletes columns in the original data.table. It is possible to add a new column (or overwrite an old column) with duplications of an aggregated calculation (e.g. see http://www.r-bloggers.com/two-of-my-favorite-data-table-features/), but that's not what you want.

In general, when you aggregate a table of data, you are necessarily producing a new table that is reduced to one row per aggregation key. The := operator does not do this.

Instead, we need to run a normal index operation on the data.table, grouping by the required aggregation key (which will automatically be included in the output data.table), and add to that the j argument which will be evaluated once for each group. The result will be a reduced version of the original table, with the results of all j argument evaluations merged with their respective aggregation keys. Since our j argument results in a scalar value for each group, our result will be one row per Location/Date aggregation key.
0 讨论(0)
发布评论:

提交评论
- 加载中...

梦毁少年i

2021-02-15 13:07

If we are using data.table, the CJ can be used

 CJ(Location=loc, date= dates,Model= mods)[, 
         Temperature:= temp][, .(Mean = mean(Temperature)), by = .(Location, date)]

0 讨论(0)