Faster reading of time series from netCDF?

后端 未结 3 1529
失恋的感觉
失恋的感觉 2020-12-29 09:10

I have some large netCDF files that contain 6 hourly data for the earth at 0.5 degree resolution.

There are 360 latitude points, 720 longitude points, and 1420 time

相关标签:
3条回答
  • 2020-12-29 09:37

    EDIT: original question had a mistake, but there might also be different overheads for starting the read, so it's fair to do multiple reps. rbenchmark makes that easy.

    The example file is a bit massive so I've used a smaller one, can you make the same comparison with your file?

    More accessible example file: ftp://ftp.cdc.noaa.gov/Datasets/noaa.oisst.v2/sst.wkmean.1990-present.nc

    I get more like twice the time taken for a time series:

    library(ncdf4)
    
    nc <- nc_open("sst.wkmean.1990-present.nc")
    
    library(rbenchmark)
    benchmark(timeseries = ncvar_get(nc, "sst", start = c(1, 1, 50), count = c(10, 10, 100)), 
    spacechunk = ncvar_get(nc, "sst", start = c(1, 1, 50), count = c(100, 100, 1)),           
    replications = 1000)
    ##        test replications elapsed relative user.self sys.self user.child sys.child
    ##2 spacechunk         1000    0.47    1.000      0.43     0.03         NA        NA
    ##1 timeseries         1000    1.04    2.213      0.58     0.47         NA        NA
    
    0 讨论(0)
  • 2020-12-29 09:43

    I think the answer to this problem won't be so much re-ordering the data as it will be chunking the data. For a full discussion on the implications of chunking netCDF files, see the following blog posts from Russ Rew, lead netCDF developer at Unidata:

    • Chunking Data: Why it Matters
    • Chunking Data: Choosing Shapes

    The upshot is that while employing different chunking strategies can achieve large increases in access speed, choosing the right strategy is non-trivial.

    On the smaller sample dataset, sst.wkmean.1990-present.nc, I saw the following results when using your benchmark command:

    1) Unchunked:

    ## test replications elapsed relative user.self sys.self user.child sys.child
    ## 2 spacechunk         1000   0.841    1.000     0.812    0.029          0         0
    ## 1 timeseries         1000   1.325    1.576     0.944    0.381          0         0
    

    2) Naively Chunked:

    ## test replications elapsed relative user.self sys.self user.child sys.child
    ## 2 spacechunk         1000   0.788    1.000     0.788    0.000          0         0
    ## 1 timeseries         1000   0.814    1.033     0.814    0.001          0         0
    

    The naive chunking was simply a shot in the dark; I used the nccopy utility thusly:

    $ nccopy -c"lat/100,lon/100,time/100,nbnds/" sst.wkmean.1990-present.nc chunked.nc

    The Unidata documentation for the nccopy utility can be found here.

    I wish I could recommend a particular strategy for chunking your data set, but it is highly dependent on the data. Hopefully the articles linked above will give you some insight into how you might chunk your data to achieve the results you're looking for!

    Update

    The following blog post by Marcos Hermida shows how different chunking strategies influenced the speed when reading a time series for a particular netCDF file. This should only be used as perhaps a jumping off point.

    • Netcdf-4 Chunking Performance Results on AR-4 3D Data File

    In regards to rechunking via nccopy apparently hanging; the issue appears to be related to the default chunk cache size of 4MB. By increasing that to 4GB (or more), you can reduce the copy time from over 24 hours for a large file to under 11 minutes!

    • Nccopy extremly slow/hangs
    • Unidata JIRA Trouble Ticket System: NCF-85, Improve use of chunk cache in nccopy utility, making it practical for rechunking large files.

    One point I'm not sure about; in the first link, the discussion is in regards to the chunk cache, but the argument passed to nccopy, -m, specifies the number of bytes in the copy buffer. The -m argument to nccopy controls the size of the chunk cache.

    0 讨论(0)
  • 2020-12-29 09:50

    Not sure if you have considered cdo to extract the point ?

    cdo remapnn,lon=x/lat=y in.nc point.nc 
    

    Sometimes CDO runs out of memory, if this happens, you might need to loop over the yearly files, and then cat the separate point files with

    cdo mergetime point_${yyyy}.nc point_series.nc 
    
    0 讨论(0)
提交回复
热议问题