I want to convert very large csv data to hdf5 in python

问题

I have a very large csv data. It looks like this.

[Date, Firm name, value 1, value 2, ..., value 60]

I want to convert this to a hdf5 file. For example, let's say I have two dates (2019-07-01, 2019-07-02), each date has 3 firms (firm 1, firm 2, firm 3) and each firm has [value 1, value 2, ... value 60].

I want to use date and firm name as a group. Specifically, I want this hierarchy: 'Date/Firm name'.

For example, 2019-07-01 has firm 1, firm 2, and firm 3. When you look at each firm, there are many [value 1, values 2, ... value 60]s.

Any ideas?

Thanks in advance.

回答1:

There are A LOT of ways to approach this problem. Before I show some code, a suggestion: Consider your data schema carefully. It is important. It will affect how easily you access and use the data. For example, your proposed schema makes it easy to access the data for one Firm for one Date. What if you want all the data for one Firm for across a range of dates? Or you want all the data for all firms for one date? Both will require you to manipulate multiple arrays after you access the data.

Although counter intuitive, you may want to store the CSV data as a single Group/Dataset. I will show an example of each in the 2 methods below. Both methods below use np.genfromtxt to read the CSV data. The optional parameter names=True will read the headers from row one in your CSV file if you have them. Omit names= if you don't have a header row and you will get default field names (f1, f2, f3, etc). My sample data is included at the end.

Method 1: using h5py
Group Names: Date
Dataset Names: Firms

import numpy as np
import h5py

csv_recarr = np.genfromtxt('SO_57120995.csv',delimiter=',',dtype=None, names=True, encoding=None)
print (csv_recarr.dtype)

with h5py.File('SO_57120995.h5','w') as h5f :

    for row in csv_recarr:   
        date=row[0]
        grp = h5f.require_group(date)

        firm=row[1]
    # convert row data to get list of all valuei entries
        row_data=row.item()[2:]
        h5f[date].create_dataset(firm,data=row_data)

Method 2: using PyTables
All data stored in Dataset: /CSV_Data

import numpy as np
import tables as tb

csv_recarr = np.genfromtxt('SO_57120995.csv',delimiter=',',dtype=None, names=True, encoding=None)
print (csv_recarr.dtype)

with tb.File('SO_57120995_2.h5','w') as h5f :
    # this should work, but only first string character is loaded:
    #dset = h5f.create_table('/','CSV_Data',obj=csv_recarr)
    # create empty table
    dset = h5f.create_table('/','CSV_Data',description=csv_recarr.dtype)

    #workaround to add CSV data one line at a time
    for row in csv_recarr:
        append_list=[]
        append_list.append(row.item()[:])
        dset.append(append_list)

# Example to extract array of data based on field name
    firm_arr = dset.read_where('Firm==b"Firm1"')
    print (firm_arr)

Example data:

Date,Firm,value1,value2,value3,value4,value5,value6,value7,value8,value9,value10
2019-07-01,Firm1,7.634758e-01,5.781637e-01,8.531480e-01,8.823769e-01,5.780567e-01,3.587480e-01,4.065076e-01,8.520372e-02,3.392133e-01,1.104916e-01
2019-07-01,Firm2,6.457887e-01,6.150677e-01,3.501075e-01,8.886556e-01,5.379832e-01,4.561159e-01,4.773242e-01,7.302280e-01,6.018719e-01,3.835672e-01
2019-07-01,Firm3,3.641129e-01,8.356681e-01,7.783146e-01,1.735361e-01,8.610319e-01,1.360989e-01,5.025533e-01,5.292365e-01,4.964461e-01,7.309130e-01
2019-07-02,Firm1,4.128258e-01,1.339008e-01,3.530394e-02,5.293509e-01,3.608783e-01,6.647519e-01,2.898612e-01,5.632466e-01,5.981161e-01,9.149318e-01
2019-07-02,Firm2,1.037654e-01,3.717925e-01,4.876283e-01,5.852448e-01,4.689806e-01,2.508458e-01,7.243468e-02,3.510882e-01,8.290331e-01,7.808357e-01
2019-07-02,Firm3,8.443163e-01,5.408783e-01,8.278920e-01,8.454836e-01,7.331165e-02,4.167235e-01,6.187155e-01,6.114338e-01,2.299935e-01,5.206390e-01
2019-07-03,Firm1,2.281612e-01,2.660087e-02,3.809895e-01,8.032823e-01,2.492683e-03,9.600432e-02,5.059484e-01,1.795972e-01,2.174838e-01,3.578077e-01
2019-07-03,Firm2,2.403236e-01,1.497736e-01,7.357259e-01,2.501746e-01,2.826287e-01,3.335158e-01,7.742914e-01,1.773830e-01,8.407694e-01,7.466135e-01
2019-07-03,Firm3,8.806318e-01,1.164414e-01,6.791358e-01,4.752967e-01,3.695451e-01,9.728813e-01,3.553896e-01,2.559315e-01,6.942147e-01,2.701471e-01
2019-07-04,Firm1,2.153168e-01,5.169252e-01,5.136280e-01,7.517068e-01,1.977217e-01,7.221689e-01,5.877799e-01,9.099813e-02,9.073012e-03,5.946624e-01
2019-07-04,Firm2,8.275230e-01,9.725115e-01,5.218725e-03,7.728741e-01,4.371698e-01,3.593862e-02,3.448388e-01,7.443235e-01,2.606604e-01,9.888835e-02
2019-07-04,Firm3,8.599242e-01,8.336458e-01,1.451350e-01,9.777518e-02,3.335788e-01,1.117006e-01,9.105203e-01,3.478112e-01,8.948065e-01,3.105299e-01

来源：https://stackoverflow.com/questions/57120995/i-want-to-convert-very-large-csv-data-to-hdf5-in-python

标签

python

hdf5

h5py

pytables