In Python 3.6, it takes longer to read a file if there are line breaks. If I have two files, one with line breaks and one without lines breaks (but otherwise they have the same
On Windows, opening in text-mode converts '\n'
characters to '\r\n'
when you write, and the reverse when you read.
So, I did some experimentation. I am on MacOS, right now, so my "native" line-ending is '\n'
, so I cooked up a similar test to yours, except use non-native, Windows line-endings:
sizeMB = 128
sizeKB = 1024 * sizeMB
with open(r'bigfile_one_line.txt', 'w') as f:
for i in range(sizeKB):
f.write('Hello World!!\t'*73) # There are roughly 73 phrases in one KB
with open(r'bigfile_newlines.txt', 'w') as f:
for i in range(sizeKB):
f.write('Hello World!\r\n'*73)
And the results:
In [4]: %%timeit
...: with open('bigfile_one_line.txt', 'r') as f:
...: text = f.read()
...:
1 loop, best of 3: 141 ms per loop
In [5]: %%timeit
...: with open('bigfile_newlines.txt', 'r') as f:
...: text = f.read()
...:
1 loop, best of 3: 543 ms per loop
In [6]: %%timeit
...: with open('bigfile_one_line.txt', 'rb') as f:
...: text = f.read()
...:
10 loops, best of 3: 76.1 ms per loop
In [7]: %%timeit
...: with open('bigfile_newlines.txt', 'rb') as f:
...: text = f.read()
...:
10 loops, best of 3: 77.4 ms per loop
Very similar to yours, and note, the performance difference disappears when I open in binary mode. OK, what if instead, I use *nix line-endings?
with open(r'bigfile_one_line_nix.txt', 'w') as f:
for i in range(sizeKB):
f.write('Hello World!\t'*73) # There are roughly 73 phrases in one KB
with open(r'bigfile_newlines_nix.txt', 'w') as f:
for i in range(sizeKB):
f.write('Hello World!\n'*73)
And the results using these new file:
In [11]: %%timeit
...: with open('bigfile_one_line_nix.txt', 'r') as f:
...: text = f.read()
...:
10 loops, best of 3: 144 ms per loop
In [12]: %%timeit
...: with open('bigfile_newlines_nix.txt', 'r') as f:
...: text = f.read()
...:
10 loops, best of 3: 138 ms per loop
Aha! The performance difference disappears! So yes, I think using non-native line-endings impacts performance, which makes sense given the behavior of text-mode.