问题
When i use pandas to convert csv files to hdf5 files the resulting file is extremely large. For example a test csv file (23 columns, 1.3 million rows) of 170Mb results in an hdf5 file of 2Gb. However if pandas is bypassed and the hdf5 file is directly written (using pytables) it is only 20Mb. In the following code (that is used to do the conversion in pandas) the values of the object columns in the dataframe are explicitly converted to string objects (to prevent pickling):
# Open the csv file as pandas data frame
data = pd.read_csv(csvfilepath, sep=delimiter, low_memory=False)
# Write the resulting data frame to the hdf5 file
data.to_hdf(hdf5_file_path, table_name, format='table', complevel=9,
complib='lzo')
This is the hdf5 file inspected (using vitables):
What seems odd to me is that the values are represented as a (python?) list by data type (values_block0:int,values_block1:float and values_block2:string) instead of 1 specific column for every column in the csv file. I'm wondering if this causes the large file size and what the impact will be on query times?
Given that about 1Tb has to be converted I would like to know what can be done to reduce the size of the resulting hdf5 file?
P.S. I'm aware of this question but is states that the large hdf5 file size is caused by the HDF5 format itself which can't be the cause in this case given that the hdf5 file resulting from bypassing pandas is much smaller.
P.P.S. Using data.iloc instead of data.loc as suggested by joris doesn't make any difference. I've removed the 'conversion' it doesn't make a difference. The info on the read dataframe as requested by Jeff:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1303331 entries, 0 to 1303330
Columns: 23 entries, _PlanId to ACTIVITY_Gratis
dtypes: float64(1), int64(5), object(17)
回答1:
Here's an informal comparison of times/sizes for various IO method
Using 0.13.1 on 64-bit linux
Setup
In [3]: N = 1000000
In [4]: df = DataFrame(dict([ ("int{0}".format(i),np.random.randint(0,10,size=N)) for i in range(5) ]))
In [5]: df['float'] = np.random.randn(N)
In [6]: from random import randrange
In [8]: for i in range(10):
...: df["object_1_{0}".format(i)] = ['%08x'%randrange(16**8) for _ in range(N)]
...:
In [9]: for i in range(7):
...: df["object_2_{0}".format(i)] = ['%15x'%randrange(16**15) for _ in range(N)]
...:
In [11]: df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000000 entries, 0 to 999999
Data columns (total 23 columns):
int0 1000000 non-null int64
int1 1000000 non-null int64
int2 1000000 non-null int64
int3 1000000 non-null int64
int4 1000000 non-null int64
float 1000000 non-null float64
object_1_0 1000000 non-null object
object_1_1 1000000 non-null object
object_1_2 1000000 non-null object
object_1_3 1000000 non-null object
object_1_4 1000000 non-null object
object_1_5 1000000 non-null object
object_1_6 1000000 non-null object
object_1_7 1000000 non-null object
object_1_8 1000000 non-null object
object_1_9 1000000 non-null object
object_2_0 1000000 non-null object
object_2_1 1000000 non-null object
object_2_2 1000000 non-null object
object_2_3 1000000 non-null object
object_2_4 1000000 non-null object
object_2_5 1000000 non-null object
object_2_6 1000000 non-null object
dtypes: float64(1), int64(5), object(17)
types: float64(1), int64(5), object(17)
Saving with various methods
In [12]: df.to_hdf('test_fixed.h5','data',format='fixed')
In [13]: df.to_hdf('test_table_no_dc.h5','data',format='table')
In [14]: df.to_hdf('test_table_dc.h5','data',format='table',data_columns=True)
In [15]: df.to_hdf('test_fixed_compressed.h5','data',format='fixed',complib='blosc',complevel=9)
!ls -ltr *.h5
In [16]: !ls -ltr *.h5
-rw-rw-r-- 1 jreback users 361093304 Apr 28 09:20 test_fixed.h5
-rw-rw-r-- 1 jreback users 311475690 Apr 28 09:21 test_table_no_dc.h5
-rw-rw-r-- 1 jreback users 351316525 Apr 28 09:22 test_table_dc.h5
-rw-rw-r-- 1 jreback users 317467870 Apr 28 2014 test_fixed_compressed.h5
The size on disk is going to be a function of the string size that is selected for each column; If you use NO data_columns then its the longest size for ANY string. So writing with data_columns can potentially the size here (balanced by the fact that you have more columns so it takes more space per-column). You prob want to specify min_item_size
to control see here
Here is an example of the on-disk structure:
In [8]: DataFrame(dict(A = ['foo','bar','bah'], B = [1,2,3], C = [1.0,2.0,3.0], D=[4.0,5.0,6.0])).to_hdf('test.h5','data',mode='w',format='table')
In [9]: !ptdump -avd test.h5
/ (RootGroup) ''
/._v_attrs (AttributeSet), 4 attributes:
[CLASS := 'GROUP',
PYTABLES_FORMAT_VERSION := '2.1',
TITLE := '',
VERSION := '1.0']
/data (Group) ''
/data._v_attrs (AttributeSet), 14 attributes:
[CLASS := 'GROUP',
TITLE := '',
VERSION := '1.0',
data_columns := [],
encoding := None,
index_cols := [(0, 'index')],
info := {1: {'type': 'Index', 'names': [None]}, 'index': {}},
levels := 1,
nan_rep := 'nan',
non_index_axes := [(1, ['A', 'B', 'C', 'D'])],
pandas_type := 'frame_table',
pandas_version := '0.10.1',
table_type := 'appendable_frame',
values_cols := ['values_block_0', 'values_block_1', 'values_block_2']]
/data/table (Table(3,)) ''
description := {
"index": Int64Col(shape=(), dflt=0, pos=0),
"values_block_0": Float64Col(shape=(2,), dflt=0.0, pos=1),
"values_block_1": Int64Col(shape=(1,), dflt=0, pos=2),
"values_block_2": StringCol(itemsize=3, shape=(1,), dflt='', pos=3)}
byteorder := 'little'
chunkshape := (1872,)
autoindex := True
colindexes := {
"index": Index(6, medium, shuffle, zlib(1)).is_csi=False}
/data/table._v_attrs (AttributeSet), 19 attributes:
[CLASS := 'TABLE',
FIELD_0_FILL := 0,
FIELD_0_NAME := 'index',
FIELD_1_FILL := 0.0,
FIELD_1_NAME := 'values_block_0',
FIELD_2_FILL := 0,
FIELD_2_NAME := 'values_block_1',
FIELD_3_FILL := '',
FIELD_3_NAME := 'values_block_2',
NROWS := 3,
TITLE := '',
VERSION := '2.7',
index_kind := 'integer',
values_block_0_dtype := 'float64',
values_block_0_kind := ['C', 'D'],
values_block_1_dtype := 'int64',
values_block_1_kind := ['B'],
values_block_2_dtype := 'string24',
values_block_2_kind := ['A']]
Data dump:
[0] (0, [1.0, 4.0], [1], ['foo'])
[1] (1, [2.0, 5.0], [2], ['bar'])
[2] (2, [3.0, 6.0], [3], ['bah'])
Dtypes are grouping into blocks (if you have data_columns then they are separate). These are just printed this way; they are stored array like.
来源:https://stackoverflow.com/questions/23334211/converting-csv-file-to-hdf5-using-pandas