h5py not sticking to chunking specification?

末鹿安然 提交于 2019-11-26 18:36:32

问题


Problem: I have existing netCDF4 files (about 5000 of them), (typically in shape 96x3712x3712) datapoints (float32). These are files with the first dimension being time (1 file per day), the second and third spatial dimensions. Currently, making a slice over the first dimension (even a partial slice), would take a lot of time because of the following reasons:

  • the netCDF files are chunked with a chunksize of 1x3712x3712. Slicing over the time dimension basically would read the entire file.
  • looping (even in multiple processes) over all of the smaller files would take a lot of time as well.

My goal:

  • create monthly files (about 2900x3712x3712) data points
  • optimize them for slicing in time dimension (chunksize of 2900x1x1 or slightly bigger in the spatial dimensions)

Other requirements:

  • the files should be appendable by a single timestamp (1x3712x3712), and this update process should take less than 15min
  • the query should be fast enough: a full slice over time in less than one second (that is 2900x1x1)==> not so much data in fact...
  • preferably the files should be accessible for read by multiple processes while being updated
  • processing the historical data (the other 5000 daily files) should take less than a couple of weeks preferably.

I tried already multiple approaches:

  • concatenating netcdf files and rechunking them==> takes too much memory and too much time...
  • writing them from pandas to an hdf file (using pytables)==> creates a wide table with a huge index. This will eventually take too much time as well to read and requires the dataset to be tiled over the spatial dimensions due to meta data constraints.
  • my last approach was writing them to an hdf5 file using h5py:

Here's the code to create a single monthly file:

import h5py
import pandas as pd
import numpy as np

def create_h5(fps):
    timestamps=pd.date_range("20050101",periods=31*96,freq='15T') #Reference time period
    output_fp = r'/data/test.h5'
    try:
        f = h5py.File(output_fp, 'a',libver='latest')
        shape = 96*nodays, 3712, 3712
        d = f.create_dataset('variable', shape=(1,3712,3712), maxshape=(None,3712,3712),dtype='f', compression='gzip', compression_opts=9,chunks=(1,29,29))
        f.swmr_mode = True
        for fp in fps:
            try:
                nc=Dataset(fp)
                times = num2date(nc.variables['time'][:], nc.variables['time'].units)
                indices=np.searchsorted(timestamps, times)
                for j,time in enumerate(times):
                    logger.debug("File: {}, timestamp: {:%Y%m%d %H:%M}, pos: {}, new_pos: {}".format(os.path.basename(fp),time,j,indices[j]))
                    d.resize((indices[j]+1,shape[1],shape[2]))
                    d[indices[j]]=nc.variables['variable'][j:j+1]
                    f.flush()
            finally:
                nc.close()
    finally:
        f.close()
    return output_fp

I'm using the last version of HDF5 to have the SWMR option. The fps argument is a list of file paths of the daily netCDF4 files. It creates the file (on an ssd, but I see that creating the file is mainly CPU bound) in about 2 hours, which is acceptable.

I have compression set up to keep the file size within limits. I did earlier tests without, and saw that the creation without is a bit faster but the slicing takes not so much longer with compression. H5py automatically chunks the dataset in 1x116x116 chunks.

Now the problem: slicing on a NAS with RAID 6 setup, takes about 20seconds to slice the time dimension, even though it is in a single chunk...

I figure, that even though it is in a single chunk in the file, because I wrote all of the values in a loop, it must be fragmented some how (don't know how this process works though). This is why I tried to do a h5repack using the CML tools of HDF5 into a new file, with the same chunks but hopefully reordering the values so that the query is able to read the values in a more sequential order, but no luck. Even though this process took 6h to run, it didn't do a thing on the query speed.

If I do my calculations right, reading one chunk (2976x32x32) is only a few MB big (11MB uncompressed, only a bit more than 1MB compressed I think). How can this take so long? What am I doing wrong? I would be glad if someone can shine a light on what is actually going on behind the scenes...


回答1:


The influence of chunk size

In a worst case scenario reading and writing one chunk can be considered as random read/write operation. The main advantage of a SSD is the speed of reading or writing small chunks of data. A HDD is much slower at this task (a factor 100 can be observed), a NAS can even be much slower than a HDD.

So the solution of the problem will be a larger chunk size. Some benchmarks on my system (Core i5-4690).

Exampe_1 (chunk size (1,29,29)=3,4 kB):

import numpy as np
import tables #needed for blosc
import h5py as h5
import time
import h5py_cache as h5c

def original_chunk_size():
    File_Name_HDF5='some_Path'
    #Array=np.zeros((1,3712,3712),dtype=np.float32)
    Array=np.random.rand(96,3712,3712)

    f = h5.File(File_Name_HDF5, 'a',libver='latest')
    f.swmr_mode = True
    nodays=1

    shape = 96*nodays, 3712, 3712
    d = f.create_dataset('variable', shape, maxshape=(None,3712,3712),dtype='f',chunks=(1,29,29),compression=32001,compression_opts=(0, 0, 0, 0, 9, 1, 1), shuffle=False)

    #Writing
    t1=time.time()
    for i in xrange(0,96*nodays):
        d[i:i+1,:,:]=Array

    f.close()
    print(time.time()-t1)

    #Reading
    f = h5.File(File_Name_HDF5, 'a',libver='latest')
    f.swmr_mode = True
    d=f['variable']

    for i in xrange(0,3712,29):
        for j in xrange(0,3712,29):
            A=np.copy(d[:,i:i+29,j:j+29])

    print(time.time()-t1)

Results (write/read):

SSD: 38s/54s

HDD: 40s/57s

NAS: 252s/823s

In the second example I will use h5py_chache because I wan't to maintain providing chunks of (1,3712,3712). The standard chunk-chache-size is only one MB so it has to be changed, to avoid multiple read/write operations on chunks. https://pypi.python.org/pypi/h5py-cache/1.0

Exampe_2 (chunk size (96,58,58)=1,3 MB):

import numpy as np
import tables #needed for blosc
import h5py as h5
import time
import h5py_cache as h5c

def modified_chunk_size():
    File_Name_HDF5='some_Path'
    Array=np.random.rand(1,3712,3712)

    f = h5c.File(File_Name_HDF5, 'a',libver='latest', 
    chunk_cache_mem_size=6*1024**3)
    f.swmr_mode = True
    nodays=1

    shape = 96*nodays, 3712, 3712
    d = f.create_dataset('variable', shape, maxshape=(None,3712,3712),dtype='f',chunks=(96,58,58),compression=32001,compression_opts=(0, 0, 0, 0, 9, 1, 1), shuffle=False)

    #Writing
    t1=time.time()
    for i in xrange(0,96*nodays):
        d[i:i+1,:,:]=Array

    f.close()
    print(time.time()-t1)

    #Reading
    f = h5c.File(File_Name_HDF5, 'a',libver='latest', chunk_cache_mem_size=6*1024**3) #6 GB chunk chache
    f.swmr_mode = True
    d=f['variable']

    for i in xrange(0,3712,58):
        for j in xrange(0,3712,58):
            A=np.copy(d[:,i:i+58,j:j+58])

    print(time.time()-t1)

Results (write/read):

SSD: 10s/16s

HDD: 10s/16s

NAS: 13s/20s

The read/write speed can further be improved by mininimizing the api calls (reading and writing larger chunk blocks).

I also want't to mention her the compression method. Blosc can achieve up to 1GB/s throughput (CPU bottlenecking) gzip is slower, but provides better compression ratios.

d = f.create_dataset('variable', shape, maxshape=(None,3712,3712),dtype='f',chunks=(96,58,58),compression='gzip', compression_opts=3)

20s/30s file size: 101 MB

d = f.create_dataset('variable', shape, maxshape=(None,3712,3712),dtype='f',chunks=(96,58,58),compression='gzip', compression_opts=6)

50s/58s file size: 87 MB

d = f.create_dataset('variable', shape, maxshape=(None,3712,3712),dtype='f',chunks=(96,58,58),compression='gzip', compression_opts=9)

50s/60s file size: 64 MB

And now a benchmark of a whole month (30 days). The writing is a bit optimized and is written with (96,3712, 3712).

def modified_chunk_size():
    File_Name_HDF5='some_Path'

    Array_R=np.random.rand(1,3712,3712)
    Array=np.zeros((96,3712,3712),dtype=np.float32)
    for j in xrange(0,96):
        Array[j,:,:]=Array_R

    f = h5.File(File_Name_HDF5, 'a',libver='latest')
    f.swmr_mode = True
    nodays=30

    shape = 96, 3712, 3712
    d = f.create_dataset('variable', shape, maxshape=(None,3712,3712),dtype='f',chunks=(96,58,58),compression=32001,compression_opts=(0, 0, 0, 0, 9, 1, 1), shuffle=False)

    #Writing
    t1=time.time()
    for i in xrange(0,96*nodays,96):
        d[i:i+96,:,:]=Array
        d.resize((d.shape[0]+96,shape[1],shape[2]))

    f.close()
    print(time.time()-t1)

    #Reading
    f = h5.File(File_Name_HDF5, 'a',libver='latest')
    f.swmr_mode = True
    d=f['variable']
    for i in xrange(0,3712,58):
        for j in xrange(0,3712,58):
            A=np.copy(d[:,i:i+58,j:j+58])

    print(time.time()-t1)

133s/301s with blosc

432s/684s with gzip compression_opts=3

I had the same problems when accessing data on a NAS. I hope this helps...



来源:https://stackoverflow.com/questions/44881895/h5py-not-sticking-to-chunking-specification

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!