I have more than 500 PyTables
stores that contain about 300Mb of data each. I would like to merge these files into a big store, using pandas append
as in the code below.
def merge_hdfs(file_list, merged_store):
for file in file_list:
store = HDFStore(file, mode='r')
merged_store.append('data', store.data)
store.close()
The append operation is very slow (it is taking up to 10 minutes to append a single store to merged_store
), and strangely the file size of merged_store
seems to be increasing by 1Gb for each appended store.
I have indicated the total number of expected rows which according to the documentation should improve performance, and having read Improve pandas (PyTables?) HDF5 table write performance I was expecting large write times, but almost 10 minutes for every 300Mb seems to be too slow, and I cannot understand why the increase in size.
I wonder if I am missing something?
For additional information, here is a description of one of the 500 PyTables.
/data/table (Table(272734,)) ''
description := {
"index": Int64Col(shape=(), dflt=0, pos=0),
"values_block_0": Float64Col(shape=(6,), dflt=0.0, pos=1),
"id": StringCol(itemsize=11, shape=(), dflt='', pos=2),
"datetaken": Int64Col(shape=(), dflt=0, pos=3),
"owner": StringCol(itemsize=15, shape=(), dflt='', pos=4),
"machine_tags": StringCol(itemsize=100, shape=(), dflt='', pos=5),
"title": StringCol(itemsize=200, shape=(), dflt='', pos=6),
"country": StringCol(itemsize=3, shape=(), dflt='', pos=7),
"place_id": StringCol(itemsize=18, shape=(), dflt='', pos=8),
"url_s": StringCol(itemsize=80, shape=(), dflt='', pos=9),
"url_o": StringCol(itemsize=80, shape=(), dflt='', pos=10),
"ownername": StringCol(itemsize=50, shape=(), dflt='', pos=11),
"tags": StringCol(itemsize=505, shape=(), dflt='', pos=12)}
byteorder := 'little'
chunkshape := (232,)
This is basically the answer here, which I recently answered.
Bottom line is this, you need to turn off indexing store.append('df',df,index=False)
. When creating the store, then index it at the end.
Furthermore turn off compression when merging the tables as well.
Indexing is a fairly expensive operation and if I recall correctly, only uses a single processor.
Finally, make sure that you create the merged with with mode='w'
as all of the subsequent operations are appends and you want to start with a clean new file.
I also would NOT specify the chunksize
upfront. Rather, after you have created the final index, perform the compression using ptrepack
and specify chunksize=auto
which will compute it for you. I don't think this will affect write performance but will optimize query performance.
You might try tweaking the chunksize
parameter to append
(this is the writing chunksize) to a larger number as well.
Obviously make sure that each of the appending tables has exactly the same structure (will raise if this is not the case).
I created this issue for an enhancement to do this 'internally': https://github.com/pydata/pandas/issues/6837
来源:https://stackoverflow.com/questions/22934996/pandas-pytables-append-performance-and-increase-in-file-size