I have some large netCDF files that contain 6 hourly data for the earth at 0.5 degree resolution.
There are 360 latitude points, 720 longitude points, and 1420 time
EDIT: original question had a mistake, but there might also be different overheads for starting the read, so it's fair to do multiple reps. rbenchmark
makes that easy.
The example file is a bit massive so I've used a smaller one, can you make the same comparison with your file?
More accessible example file: ftp://ftp.cdc.noaa.gov/Datasets/noaa.oisst.v2/sst.wkmean.1990-present.nc
I get more like twice the time taken for a time series:
library(ncdf4)
nc <- nc_open("sst.wkmean.1990-present.nc")
library(rbenchmark)
benchmark(timeseries = ncvar_get(nc, "sst", start = c(1, 1, 50), count = c(10, 10, 100)),
spacechunk = ncvar_get(nc, "sst", start = c(1, 1, 50), count = c(100, 100, 1)),
replications = 1000)
## test replications elapsed relative user.self sys.self user.child sys.child
##2 spacechunk 1000 0.47 1.000 0.43 0.03 NA NA
##1 timeseries 1000 1.04 2.213 0.58 0.47 NA NA
I think the answer to this problem won't be so much re-ordering the data as it will be chunking the data. For a full discussion on the implications of chunking netCDF files, see the following blog posts from Russ Rew, lead netCDF developer at Unidata:
The upshot is that while employing different chunking strategies can achieve large increases in access speed, choosing the right strategy is non-trivial.
On the smaller sample dataset, sst.wkmean.1990-present.nc
, I saw the following results when using your benchmark command:
1) Unchunked:
## test replications elapsed relative user.self sys.self user.child sys.child
## 2 spacechunk 1000 0.841 1.000 0.812 0.029 0 0
## 1 timeseries 1000 1.325 1.576 0.944 0.381 0 0
2) Naively Chunked:
## test replications elapsed relative user.self sys.self user.child sys.child
## 2 spacechunk 1000 0.788 1.000 0.788 0.000 0 0
## 1 timeseries 1000 0.814 1.033 0.814 0.001 0 0
The naive chunking was simply a shot in the dark; I used the nccopy
utility thusly:
$ nccopy -c"lat/100,lon/100,time/100,nbnds/" sst.wkmean.1990-present.nc chunked.nc
The Unidata documentation for the nccopy
utility can be found here.
I wish I could recommend a particular strategy for chunking your data set, but it is highly dependent on the data. Hopefully the articles linked above will give you some insight into how you might chunk your data to achieve the results you're looking for!
The following blog post by Marcos Hermida shows how different chunking strategies influenced the speed when reading a time series for a particular netCDF file. This should only be used as perhaps a jumping off point.
In regards to rechunking via nccopy
apparently hanging; the issue appears to be related to the default chunk cache size of 4MB. By increasing that to 4GB (or more), you can reduce the copy time from over 24 hours for a large file to under 11 minutes!
One point I'm not sure about; in the first link, the discussion is in regards to the chunk cache
, but the argument passed to nccopy, -m
, specifies the number of bytes in the copy buffer. The -m
argument to nccopy controls the size of the chunk cache.
Not sure if you have considered cdo to extract the point ?
cdo remapnn,lon=x/lat=y in.nc point.nc
Sometimes CDO runs out of memory, if this happens, you might need to loop over the yearly files, and then cat the separate point files with
cdo mergetime point_${yyyy}.nc point_series.nc