问题
I've seen some ways to read a formatted binary file in Python to Pandas, namely, I'm using this code that read using NumPy fromfile formatted with a structure given using dtype.
import numpy as np
import pandas as pd
input_file_name = 'test.hst'
input_file = open(input_file_name, 'rb')
header = input_file.read(96)
dt_header = np.dtype([('version', 'i4'),
('copyright', 'S64'),
('symbol', 'S12'),
('period', 'i4'),
('digits', 'i4'),
('timesign', 'i4'),
('last_sync', 'i4')])
header = np.fromstring(header, dt_header)
dt_records = np.dtype([('ctm', 'i4'),
('open', 'f8'),
('low', 'f8'),
('high', 'f8'),
('close', 'f8'),
('volume', 'f8')])
records = np.fromfile(input_file, dt_records)
input_file.close()
df_records = pd.DataFrame(records)
# Now, do some changes in the individual values of df_records
# and then write it back to a binary file
Now, my issue is on how to write this back to a new file. I can't find any function in NumPy (neither in Pandas) that allows me to specify exactly the bytes to use in each field to write.
回答1:
Pandas now offers a wide variety of formats that are more stable than tofile(). tofile() is best for quick file storage where you do not expect the file to be used on a different machine where the data may have a different endianness (big-/little-endian).
Format Type Data Description Reader Writer
text CSV read_csv to_csv
text JSON read_json to_json
text HTML read_html to_html
text Local clipboard read_clipboard to_clipboard
binary MS Excel read_excel to_excel
binary HDF5 Format read_hdf to_hdf
binary Feather Format read_feather to_feather
binary Parquet Format read_parquet to_parquet
binary Msgpack read_msgpack to_msgpack
binary Stata read_stata to_stata
binary SAS read_sas
binary Python Pickle Format read_pickle to_pickle
SQL SQL read_sql to_sql
SQL Google Big Query read_gbq to_gbq
For small to medium sized files, I prefer CSV, as properly-formatted CSV can store arbitrary string data, is human readable, and is as dirt-simple as any format can be while achieving the previous two goals.
At one time, I used HDF5, but if I were on Amazon, I would consider using parquet.
Example of using to_hdf:
df.to_hdf('tmp.hdf','df', mode='w')
df2 = pd.read_hdf('tmp.hdf','df')
I no longer favor the HDF5 format. It has serious risks for long-term archival since it is fairly complex. It has a 150 page specification, and only one 300,000 line C implementation.
In contrast, as long as you are working exclusively in Python, the pickle format claims long term stability:
The pickle serialization format is guaranteed to be backwards compatible across Python releases provided a compatible pickle protocol is chosen and pickling and unpickling code deals with Python 2 to Python 3 type differences if your data is crossing that unique breaking change language boundary.
However, pickles allow arbitrary code execution so care should be exercised with pickles of unknown origin.
回答2:
It isn't clear to me if the DataFrame
is a view or a copy, but assuming it is a copy, you can use the to_records method of the DataFrame.
This gives you back a record array that you can then put to disk using tofile
.
e.g.
df_records = pd.DataFrame(records)
# do some stuff
new_recarray = df_records.to_records()
new_recarray.tofile("myfile.npy")
The data will reside in memory as packed bytes with the format described by the recarray dtype.
来源:https://stackoverflow.com/questions/26348095/writing-a-formated-binary-file-from-a-pandas-dataframe