Why is it faster to read a file without line breaks?

后端 未结 3 1571
借酒劲吻你
借酒劲吻你 2021-02-19 15:47

In Python 3.6, it takes longer to read a file if there are line breaks. If I have two files, one with line breaks and one without lines breaks (but otherwise they have the same

3条回答
  •  隐瞒了意图╮
    2021-02-19 16:25

    On Windows, opening in text-mode converts '\n' characters to '\r\n' when you write, and the reverse when you read.

    So, I did some experimentation. I am on MacOS, right now, so my "native" line-ending is '\n', so I cooked up a similar test to yours, except use non-native, Windows line-endings:

    sizeMB = 128
    sizeKB = 1024 * sizeMB
    
    with open(r'bigfile_one_line.txt', 'w') as f:
        for i in range(sizeKB):
            f.write('Hello World!!\t'*73)  # There are roughly 73 phrases in one KB
    
    with open(r'bigfile_newlines.txt', 'w') as f:
        for i in range(sizeKB):
            f.write('Hello World!\r\n'*73)
    

    And the results:

    In [4]: %%timeit
       ...: with open('bigfile_one_line.txt', 'r') as f:
       ...:     text = f.read()
       ...:
    1 loop, best of 3: 141 ms per loop
    
    In [5]: %%timeit
       ...: with open('bigfile_newlines.txt', 'r') as f:
       ...:     text = f.read()
       ...:
    1 loop, best of 3: 543 ms per loop
    
    In [6]: %%timeit
       ...: with open('bigfile_one_line.txt', 'rb') as f:
       ...:     text = f.read()
       ...:
    10 loops, best of 3: 76.1 ms per loop
    
    In [7]: %%timeit
       ...: with open('bigfile_newlines.txt', 'rb') as f:
       ...:     text = f.read()
       ...:
    10 loops, best of 3: 77.4 ms per loop
    

    Very similar to yours, and note, the performance difference disappears when I open in binary mode. OK, what if instead, I use *nix line-endings?

    with open(r'bigfile_one_line_nix.txt', 'w') as f:
        for i in range(sizeKB):
            f.write('Hello World!\t'*73)  # There are roughly 73 phrases in one KB
    
    with open(r'bigfile_newlines_nix.txt', 'w') as f:
        for i in range(sizeKB):
            f.write('Hello World!\n'*73)
    

    And the results using these new file:

    In [11]: %%timeit
        ...: with open('bigfile_one_line_nix.txt', 'r') as f:
        ...:     text = f.read()
        ...:
    10 loops, best of 3: 144 ms per loop
    
    In [12]: %%timeit
        ...: with open('bigfile_newlines_nix.txt', 'r') as f:
        ...:     text = f.read()
        ...:
    10 loops, best of 3: 138 ms per loop
    

    Aha! The performance difference disappears! So yes, I think using non-native line-endings impacts performance, which makes sense given the behavior of text-mode.

提交回复
热议问题