Creating a new NetCDF from existing NetCDF file while preserving the compression of the original file

问题

I am trying to create a new NetCDF file from an existing NetCDF file. I am only interested in using 12 variables from a list of 177 variables. You can find the sample NetCDF file from this ftp site here.

I used the following code from a previous SO answer. You can find it here.

import netCDF4 as nc

file1 = '/media/sf_jason2/cycle_001/JA2_GPN_2PdP001_140_20080717_113355_20080717_123008.nc'
file2 = '/home/sandbox/test.nc'

toinclude = ['lat_20hz', 'lon_20hz', 'time_20hz', 'alt_20hz', 'ice_range_20hz_ku', 'ice_qual_flag_20hz_ku', 'model_dry_tropo_corr', 'model_wet_tropo_corr', 'iono_corr_gim_ku', 'solid_earth_tide', 'pole_tide', 'alt_state_flag_ku_band_status']

with nc.Dataset(file1) as src, nc.Dataset(file2, "w") as dst:
    # copy attributes
    for name in src.ncattrs():
        dst.setncattr(name, src.getncattr(name))
    # copy dimensions
    for name, dimension in src.dimensions.iteritems():
        dst.createDimension(
        name, (len(dimension) if not dimension.isunlimited else None))
    # copy all file data for variables that are included in the toinclude list
    for name, variable in src.variables.iteritems():
        if name in toinclude:
            x = dst.createVariable(name, variable.datatype, variable.dimensions)
            dst.variables[name][:] = src.variables[name][:]

The issue that I am having is that the original file is only 5.3 MB, however when I copy the new variables over the new file size is around 17 MB. The whole point of stripping the variables is to decrease the file size, but I am ending up with a larger file size.

I have tried using xarray as well. But I am having issues when I am trying to merge multiple variables. The following is the code that I am trying to implement in xarray.

import xarray as xr

fName = '/media/sf_jason2/cycle_001/JA2_GPN_2PdP001_140_20080717_113355_20080717_123008.nc'
file2 = '/home/sandbox/test.nc'
toinclude = ['lat_20hz', 'lon_20hz', 'time_20hz', 'alt_20hz', 'ice_range_20hz_ku', 'ice_qual_flag_20hz_ku', 'model_dry_tropo_corr', 'model_wet_tropo_corr', 'iono_corr_gim_ku', 'solid_earth_tide', 'pole_tide', 'alt_state_flag_ku_band_status']

ds = xr.open_dataset(fName)
newds = xr.Dataset()
newds['lat_20hz'] = ds['lat_20hz']
newds.to_netcdf(file2)

Xarray works fine if I am trying to copy over one variable, however, it's having issues when I am trying to copy multiple variables to an empty dataset. I couldn't find any good examples of copying multiple variables using xarray. I am fine achieving this workflow either way.

Ultimately, How can I decrease the file size of the new NetCDF that is being created using netCDF4? If that's not ideal, is there a way to add multiple variables to an empty dataset in xarray without merging issues?

回答1:

Would the following workflow suffice:

ds = xr.open_dataset(fName)
ds[toinclude].to_netcdf(file2)

Since you mentioned trying to decrease the file size, you should take a look at Xarray's documentation on "writing encoded data". You may want to do something like:

encoding = {v: {'zlib: True, 'complevel': 4} for v in toinclude}
ds[toinclude].to_netcdf(file2, encoding=encoding, engine='netcdf4')

回答2:

Your original file format is NETCDF3_CLASSIC but your copy is NETCDF4_CLASSIC. That is increasing the resultant file size, not sure why, but I've run into this before.

with nc.Dataset(file1) as src, nc.Dataset(file2, "w") as dst:

to:

with nc.Dataset(file1) as src, nc.Dataset(file2, "w", format="NETCDF3_CLASSIC") as dst:

For some reason unknown to me this caused a problem with your check for unlimited dimensions, which was also easily fixed.

My modified script is below. Resultant NetCDF file is 1.4 MB

import netCDF4 as nc

file1 = 'JA2_GPN_2PdP001_140_20080717_113355_20080717_123008.nc'
file2 = 'test.nc'

toinclude = ['lat_20hz', 'lon_20hz', 'time_20hz', 'alt_20hz', 'ice_range_20hz_ku', 'ice_qual_flag_20hz_ku', 'model_dry_tropo_corr', 'model_wet_tropo_corr', 'iono_corr_gim_ku', 'solid_earth_tide', 'pole_tide', 'alt_state_flag_ku_band_status']

with nc.Dataset(file1) as src, nc.Dataset(file2, "w", format="NETCDF3_CLASSIC") as dst:
 # copy attributes
  for name in src.ncattrs():
    dst.setncattr(name, src.getncattr(name))
  # copy dimensions
  for name, dimension in src.dimensions.iteritems():
    if dimension.isunlimited():
      dst.createDimension( name, None)
    else:
      dst.createDimension( name, len(dimension))
  # copy all file data for variables that are included in the toinclude list
  for name, variable in src.variables.iteritems():
    if name in toinclude:
      x = dst.createVariable(name, variable.datatype, variable.dimensions)
      dst.variables[name][:] = src.variables[name][:]

回答3:

If you are using netCDF4 python's package then you have available the command line tools from the netcdf-c library. For example nccopy which allows to copy from one netCDF file to another netCDF file filtering variables:

$ VARS = "lat_20hz,lon_20hz,time_20hz,alt_20hz,ice_range_20hz_ku,ice_qual_flag_20hz_ku,model_dry_tropo_corr,model_wet_tropo_corr,iono_corr_gim_ku,solid_earth_tide,pole_tide,alt_state_flag_ku_band_status"
$ nccopy -V $VARS JA2_GPN_2PdP001_140_20080717_113355_20080717_123008.nc dum.nc

the resulting file dum.nc will have the only required variables and it's size will in proportion. The output format will the same as the input, in this case classic or netCDF3. You can choose netCDF4-classic model format:

$ nccopy -k4 -V $VARS JA2_GPN_2PdP001_140_20080717_113355_20080717_123008.nc dum_k4.nc

which it will incur in a little more overhead in size (<4KiB). But if you are really worried about size you can deflate the data using netCDF4-classic data model format, like:

$ nccopy -k4 -s -d9 -V $VARS JA2_GPN_2PdP001_140_20080717_113355_20080717_123008.nc dum_k4_s_d9.nc

which size will be just 33% of the previous one.

[ edit: here it's the command to copy from original file to a new one with deflate option with no variable filter: $ nccopy -k4 -s -d9 JA2_GPN_2PdP001_140_20080717_113355_20080717_123008.nc orig_k4_s_d9.nc ]

Here you have the file sizes of the different files for comparison:

   Size File
5514208 JA2_GPN_2PdP001_140_20080717_113355_20080717_123008.nc
2579535 orig_k4_s_d9.nc
1494174 dum_k4.nc
1457076 dum.nc
 487695 dum_k4_s_d9.nc

the nco has also other deflating lossy algorithms based on bit-shaving and scale-offset.

Take a look at the nccopy -h command help.

来源：https://stackoverflow.com/questions/48755957/creating-a-new-netcdf-from-existing-netcdf-file-while-preserving-the-compression

标签

python

netcdf

python-xarray