问题
I have a large binary file that I want to read in an array. The format of the binary files is:
- there is an extra data of 4 bytes at the start and end of each row that I'm not using;
- in between I have 8 bytes values
I'm doing it like this:
# nlines - number of row in the binary file
# ncols - number of values to read from a row
fidbin=open('toto.mda' ,'rb'); #open this file
temp = fidbin.read(4) #skip the first 4 bytes
nvalues = nlines * ncols # Total number of values
array=np.zeros(nvalues,dtype=np.float)
#read ncols values per line and skip the useless data at the end
for c in range(int(nlines)): #read the nlines of the *.mda file
matrix = np.fromfile(fidbin, np.float64,count=int(ncols)) #read all the values from one row
Indice_start = c*ncols
array[Indice_start:Indice_start+ncols]=matrix
fidbin.seek( fidbin.tell() + 8) #fid.tell() the actual read position + skip bytes (4 at the end of the line + 4 at the beginning of the second line)
fidbin.close()
It works well but the problem is that is very slow for large binary file. Is there a way to increase the reading speed of the binary file?
回答1:
You can use a structured data type and read the file with a single call to numpy.fromfile. For example, my file qaz.mda
has five columns of floating point values between the four byte markers at the start and end of each row. Here's how you can create a structured data type and read the data.
First, create a data type that matches the format of each row:
In [547]: ncols = 5
In [548]: dt = np.dtype([('pre', np.int32), ('data', np.float64, ncols), ('post', np.int32)])
Read the file into a structured array:
In [549]: a = np.fromfile("qaz.mda", dtype=dt)
In [550]: a
Out[550]:
array([(1, [0.0, 1.0, 2.0, 3.0, 4.0], 0),
(2, [5.0, 6.0, 7.0, 8.0, 9.0], 0),
(3, [10.0, 11.0, 12.0, 13.0, 14.0], 0),
(4, [15.0, 16.0, 17.0, 18.0, 19.0], 0),
(5, [20.0, 21.0, 22.0, 23.0, 24.0], 0)],
dtype=[('pre', '<i4'), ('data', '<f8', (5,)), ('post', '<i4')])
Pull out just the data that we want:
In [551]: data = a['data']
In [552]: data
Out[552]:
array([[ 0., 1., 2., 3., 4.],
[ 5., 6., 7., 8., 9.],
[ 10., 11., 12., 13., 14.],
[ 15., 16., 17., 18., 19.],
[ 20., 21., 22., 23., 24.]])
You could also experiment with numpy.memmap to see if it improves performance:
In [563]: a = np.memmap("qaz.mda", dtype=dt)
In [564]: a
Out[564]:
memmap([(1, [0.0, 1.0, 2.0, 3.0, 4.0], 0),
(2, [5.0, 6.0, 7.0, 8.0, 9.0], 0),
(3, [10.0, 11.0, 12.0, 13.0, 14.0], 0),
(4, [15.0, 16.0, 17.0, 18.0, 19.0], 0),
(5, [20.0, 21.0, 22.0, 23.0, 24.0], 0)],
dtype=[('pre', '<i4'), ('data', '<f8', (5,)), ('post', '<i4')])
In [565]: data = a['data']
In [566]: data
Out[566]:
memmap([[ 0., 1., 2., 3., 4.],
[ 5., 6., 7., 8., 9.],
[ 10., 11., 12., 13., 14.],
[ 15., 16., 17., 18., 19.],
[ 20., 21., 22., 23., 24.]])
Note that data
above is still a memory-mapped array. To ensure that the data is copied to an array in memory, numpy.copy
can be used:
In [567]: data = np.copy(a['data'])
In [568]: data
Out[568]:
array([[ 0., 1., 2., 3., 4.],
[ 5., 6., 7., 8., 9.],
[ 10., 11., 12., 13., 14.],
[ 15., 16., 17., 18., 19.],
[ 20., 21., 22., 23., 24.]])
Whether or not that is necessary depends on how you will use the array in the rest of your code.
来源:https://stackoverflow.com/questions/37119687/improve-speed-when-reading-a-binary-file