Writing a formated binary file from a Pandas Dataframe

问题

I've seen some ways to read a formatted binary file in Python to Pandas, namely, I'm using this code that read using NumPy fromfile formatted with a structure given using dtype.

import numpy as np
import pandas as pd

input_file_name = 'test.hst'

input_file = open(input_file_name, 'rb')
header = input_file.read(96)

dt_header = np.dtype([('version', 'i4'),
                      ('copyright', 'S64'),
                      ('symbol', 'S12'),
                      ('period', 'i4'),
                      ('digits', 'i4'),
                      ('timesign', 'i4'),
                      ('last_sync', 'i4')])

header = np.fromstring(header, dt_header)

dt_records = np.dtype([('ctm', 'i4'),
                       ('open', 'f8'),
                       ('low', 'f8'),
                       ('high', 'f8'),
                       ('close', 'f8'),
                       ('volume', 'f8')])
records = np.fromfile(input_file, dt_records)

input_file.close()

df_records = pd.DataFrame(records)
# Now, do some changes in the individual values of df_records
# and then write it back to a binary file

Now, my issue is on how to write this back to a new file. I can't find any function in NumPy (neither in Pandas) that allows me to specify exactly the bytes to use in each field to write.

回答1:

Pandas now offers a wide variety of formats that are more stable than tofile(). tofile() is best for quick file storage where you do not expect the file to be used on a different machine where the data may have a different endianness (big-/little-endian).

Format Type Data Description     Reader         Writer
text        CSV                  read_csv       to_csv
text        JSON                 read_json      to_json
text        HTML                 read_html      to_html
text        Local clipboard      read_clipboard to_clipboard
binary      MS Excel             read_excel     to_excel
binary      HDF5 Format          read_hdf       to_hdf
binary      Feather Format       read_feather   to_feather
binary      Parquet Format       read_parquet   to_parquet
binary      Msgpack              read_msgpack   to_msgpack
binary      Stata                read_stata     to_stata
binary      SAS                  read_sas    
binary      Python Pickle Format read_pickle    to_pickle
SQL         SQL                  read_sql       to_sql
SQL         Google Big Query     read_gbq       to_gbq

For small to medium sized files, I prefer CSV, as properly-formatted CSV can store arbitrary string data, is human readable, and is as dirt-simple as any format can be while achieving the previous two goals.

At one time, I used HDF5, but if I were on Amazon, I would consider using parquet.

Example of using to_hdf:

df.to_hdf('tmp.hdf','df', mode='w')
df2 = pd.read_hdf('tmp.hdf','df')

I no longer favor the HDF5 format. It has serious risks for long-term archival since it is fairly complex. It has a 150 page specification, and only one 300,000 line C implementation.

In contrast, as long as you are working exclusively in Python, the pickle format claims long term stability:

The pickle serialization format is guaranteed to be backwards compatible across Python releases provided a compatible pickle protocol is chosen and pickling and unpickling code deals with Python 2 to Python 3 type differences if your data is crossing that unique breaking change language boundary.

However, pickles allow arbitrary code execution so care should be exercised with pickles of unknown origin.

回答2:

It isn't clear to me if the DataFrame is a view or a copy, but assuming it is a copy, you can use the to_records method of the DataFrame.

This gives you back a record array that you can then put to disk using tofile.

e.g.

df_records = pd.DataFrame(records)
# do some stuff
new_recarray = df_records.to_records()
new_recarray.tofile("myfile.npy")

The data will reside in memory as packed bytes with the format described by the recarray dtype.

来源：https://stackoverflow.com/questions/26348095/writing-a-formated-binary-file-from-a-pandas-dataframe

标签

python

numpy

pandas

binaryfiles