I\'ve got a program that reads in 3 strings per line for 50000. It then does other things. The part that reads the file and converts to integers is taking 80% of the total r
If the file is in OS cache then parsing the file takes milliseconds on my machine:
name time ratio comment
read_read 145 usec 1.00 big.txt
read_readtxt 2.07 msec 14.29 big.txt
read_readlines 4.94 msec 34.11 big.txt
read_james_otigo 29.3 msec 201.88 big.txt
read_james_otigo_with_int_float 82.9 msec 571.70 big.txt
read_map_local 93.1 msec 642.23 big.txt
read_map 95.6 msec 659.57 big.txt
read_numpy_loadtxt 321 msec 2213.66 big.txt
Where the read_*()
functions are:
def read_read(filename):
with open(filename, 'rb') as file:
data = file.read()
def read_readtxt(filename):
with open(filename, 'rU') as file:
text = file.read()
def read_readlines(filename):
with open(filename, 'rU') as file:
lines = file.readlines()
def read_james_otigo(filename):
file = open (filename).readlines()
for line in file[1:]:
label1, label2, edge = line.strip().split()
def read_james_otigo_with_int_float(filename):
file = open (filename).readlines()
for line in file[1:]:
label1, label2, edge = line.strip().split()
label1 = int(label1); label2 = int(label2); edge = float(edge)
def read_map(filename):
with open(filename) as file:
L = [(int(l1), int(l2), float(edge))
for line in file
for l1, l2, edge in [line.split()] if line.strip()]
def read_map_local(filename, _i=int, _f=float):
with open(filename) as file:
L = [(_i(l1), _i(l2), _f(edge))
for line in file
for l1, l2, edge in [line.split()] if line.strip()]
import numpy as np
def read_numpy_loadtxt(filename):
a = np.loadtxt('big.txt', dtype=[('label1', 'i'),
('label2', 'i'),
('edge', 'f')])
And big.txt
is generated using:
#!/usr/bin/env python
import numpy as np
n = 50000
a = np.random.random_integers(low=0, high=1<<10, size=2*n).reshape(-1, 2)
np.savetxt('big.txt', np.c_[a, np.random.rand(n)], fmt='%i %i %s')
It produces 50000 lines:
150 952 0.355243621018
582 98 0.227592557278
478 409 0.546382780254
46 879 0.177980983303
...
To reproduce results, download the code and run:
# write big.txt
python generate-file.py
# run benchmark
python read-array.py