I am working with a huge data table in R containing monthly measurements of temperature for multiple locations, taken by different sources.
The dataset looks like this:<
I don't think you generated your test data correctly. The function expand.grid() takes a cartesian product of all arguments. I'm not sure why you included the Temperature=temp
argument in the expand.grid()
call; that duplicates each temperature value for every single key combination, resulting in a data.table with 9 million rows (this is (10*60*5)^2
). I think you intended one temperature value per key, which should result in 10*60*5
rows:
df <- data.table(expand.grid(Location=loc,Date=dates,Model=mods),Temperature=temp);
df;
## Location Date Model Temperature
## 1: 1 2000-01-01 A 2.469751
## 2: 2 2000-01-01 A 16.103135
## 3: 3 2000-01-01 A 7.147051
## 4: 4 2000-01-01 A 10.301937
## 5: 5 2000-01-01 A 16.760238
## ---
## 2996: 6 2004-12-01 E 26.293968
## 2997: 7 2004-12-01 E 8.446528
## 2998: 8 2004-12-01 E 29.003001
## 2999: 9 2004-12-01 E 12.076765
## 3000: 10 2004-12-01 E 28.410980
If this is correct, you can generate the means across models with this:
df[,.(Mean=mean(Temperature)),.(Location,Date)];
## Location Date Mean
## 1: 1 2000-01-01 9.498497
## 2: 2 2000-01-01 11.744622
## 3: 3 2000-01-01 15.691228
## 4: 4 2000-01-01 11.457154
## 5: 5 2000-01-01 8.897931
## ---
## 596: 6 2004-12-01 17.587000
## 597: 7 2004-12-01 19.555963
## 598: 8 2004-12-01 15.710465
## 599: 9 2004-12-01 15.322790
## 600: 10 2004-12-01 20.240392
Note that the :=
operator does not actually aggregate. It only adds, modifies, or deletes columns in the original data.table. It is possible to add a new column (or overwrite an old column) with duplications of an aggregated calculation (e.g. see http://www.r-bloggers.com/two-of-my-favorite-data-table-features/), but that's not what you want.
In general, when you aggregate a table of data, you are necessarily producing a new table that is reduced to one row per aggregation key. The :=
operator does not do this.
Instead, we need to run a normal index operation on the data.table, grouping by the required aggregation key (which will automatically be included in the output data.table), and add to that the j
argument which will be evaluated once for each group. The result will be a reduced version of the original table, with the results of all j
argument evaluations merged with their respective aggregation keys. Since our j
argument results in a scalar value for each group, our result will be one row per Location
/Date
aggregation key.
If we are using data.table
, the CJ
can be used
CJ(Location=loc, date= dates,Model= mods)[,
Temperature:= temp][, .(Mean = mean(Temperature)), by = .(Location, date)]