Python/PyTables: Is it possible to have different data types for different columns of an array?

问题

I create an expandable earray of Nx4 columns. Some columns require float64 datatype, the others can be managed with int32. Is it possible to vary the data types among the columns? Right now I just use one (float64, below) for all, but it takes huge disk space for (>10 GB) files.

For example, how can I ensure column 1-2 elements are int32 and 3-4 elements are float64?

import tables
f1 = tables.open_file("table.h5", "w")
a = f1.create_earray(f1.root, "dataset_1", atom=tables.Float32Atom(), shape=(0, 4))

Here is a simplistic version of how I am appending using Earray:

Matrix = np.ones(shape=(10**6, 4))

if counter <= 10**6: # keep appending to Matrix until 10**6 rows
    Matrix[s:s+length, 0:4] = chunk2[left:right] # chunk2 is input np.ndarray
    s += length

# save to disk when rows = 10**6
if counter > 10**6:
    a.append(Matrix[:s])  
    del Matrix
    Matrix = np.ones(shape=(10**6, 4))

What are the cons for the following method?

import tables as tb
import numpy as np

filename = 'foo.h5'
f = tb.open_file(filename, mode='w')
int_app = f.create_earray(f.root, "col1", atom=tb.Int32Atom(), shape=(0,2), chunkshape=(3,2))
float_app = f.create_earray(f.root, "col2", atom=tb.Float64Atom(), shape=(0,2), chunkshape=(3,2))

# array containing ints..in reality it will be 10**6x2
arr1 = np.array([[1, 1],
                [2, 2],
                [3, 3]], dtype=np.int32)

# array containing floats..in reality it will be 10**6x2
arr2 = np.array([[1.1,1.2],
                 [1.1,1.2],
                 [1.1,1.2]], dtype=np.float64)

for i in range(3):
    int_app.append(arr1)
    float_app.append(arr2)

f.close()

print('\n*********************************************************')
print("\t\t Reading Now=> ")
print('*********************************************************')
c = tb.open_file('foo.h5', mode='r')
chunks1 = c.root.col1
chunks2 = c.root.col2
chunk1 = chunks1.read()
chunk2 = chunks2.read()
print(chunk1)
print(chunk2)

回答1:

No and Yes. All PyTables array types (Array, CArray, EArray, VLArray) are for homogeneous datatypes (similar to a NumPy ndarray). If you want to mix datatypes, you need to use a Table. Tables are extendable; they have an .append() method to add rows of data.

The creation process is similar to this answer (only the dtype is different): PyTables create_array fails to save numpy array. You only define the datatypes for a row. You don't define the shape or number of rows. That is implied as you add data to the table. If you already have your data in a NumPy recarray, you can reference it with the description= entry, and the Table will use the dtype for the table and populate with the data. More info here: PyTables Tables Class

Your code would look something like this:

import tables as tb
import numpy as np
table_dt = np.dtype(
           {'names': ['int1', 'int2', 'float1', 'float2'], 
            'formats': [int, int, float, float] } )
# Create some random data:
i1 = np.random.randint(0,1000, (10**6,) )
i2 = np.random.randint(0,1000, (10**6,) )
f1 = np.random.rand(10**6)
f2 = np.random.rand(10**6)

with tb.File('table.h5', 'w') as h5f:
    a = h5f.create_table('/', 'dataset_1', description=table_dt)

# Method 1 to create empty recarray 'Matrix', then add data:     
    Matrix = np.recarray( (10**6,), dtype=table_dt)
    Matrix['int1'] = i1
    Matrix['int2'] = i2
    Matrix['float1'] = f1
    Matrix['float2'] = f2        
# Append Matrix to the table
    a.append(Matrix)

# Method 2 to create recarray 'Matrix' with data in 1 step:       
    Matrix = np.rec.fromarrays([i1, i2, f1, f2], dtype=table_dt)
# Append Matrix to the table
    a.append(Matrix)

You mentioned creating a very large file, but did not say how many rows (obviously way more than 10**6). Here are some additional thoughts based on comments in another thread.

The .create_table() method has an optional parameter: expectedrows=. This parameter is used 'to optimize the HDF5 B-Tree and amount of memory used'. Default value is set in tables/parameters.py (look for EXPECTED_ROWS_TABLE; It's only 10000 in my installation.) I highly suggest you set this to a larger value if you are creating 10**6 (or more) rows.

Also, you should consider file compression. There's a trade-off: compression reduces the file size, but will reduce I/O performance (increases access time). There are a few options:

Enable compression when you create the file (add the filters= parameter when you create the file). Start with tb.Filters(complevel=1).
Use the HDF Group utility h5repack - run against a HDF5 file to create a new file (useful to go from uncompressed to compressed, or vice-versa).
Use the PyTables utility ptrepack - works similar to h4repack and delivered with PyTables.

I tend to use uncompressed files I work with often for best I/O performance. Then when done, I convert to compressed format for long term archiving.

来源：https://stackoverflow.com/questions/63495969/python-pytables-is-it-possible-to-have-different-data-types-for-different-colum

标签

python

arrays

pandas

numpy

pytables