How to read a file in reverse order?

前端 未结 21 2561
礼貌的吻别
礼貌的吻别 2020-11-22 04:51

How to read a file in reverse order using python? I want to read a file from last line to first line.

相关标签:
21条回答
  • 2020-11-22 05:52

    You can also use python module file_read_backwards.

    After installing it, via pip install file_read_backwards (v1.2.1), you can read the entire file backwards (line-wise) in a memory efficient manner via:

    #!/usr/bin/env python2.7
    
    from file_read_backwards import FileReadBackwards
    
    with FileReadBackwards("/path/to/file", encoding="utf-8") as frb:
        for l in frb:
             print l
    

    It supports "utf-8","latin-1", and "ascii" encodings.

    Support is also available for python3. Further documentation can be found at http://file-read-backwards.readthedocs.io/en/latest/readme.html

    0 讨论(0)
  • 2020-11-22 05:52

    Accepted answer won't work for cases with large files that won't fit in memory (which is not a rare case).

    As it was noted by others, @srohde answer looks good, but it has next issues:

    • openning file looks redundant, when we can pass file object & leave it to user to decide in which encoding it should be read,
    • even if we refactor to accept file object, it won't work for all encodings: we can choose file with utf-8 encoding and non-ascii contents like

      й
      

      pass buf_size equal to 1 and will have

      UnicodeDecodeError: 'utf8' codec can't decode byte 0xb9 in position 0: invalid start byte
      

      of course text may be larger but buf_size may be picked up so it'll lead to obfuscated error like above,

    • we can't specify custom line separator,
    • we can't choose to keep line separator.

    So considering all these concerns I've written separate functions:

    • one which works with byte streams,
    • second one which works with text streams and delegates its underlying byte stream to the first one and decodes resulting lines.

    First of all let's define next utility functions:

    ceil_division for making division with ceiling (in contrast with standard // division with floor, more info can be found in this thread)

    def ceil_division(left_number, right_number):
        """
        Divides given numbers with ceiling.
        """
        return -(-left_number // right_number)
    

    split for splitting string by given separator from right end with ability to keep it:

    def split(string, separator, keep_separator):
        """
        Splits given string by given separator.
        """
        parts = string.split(separator)
        if keep_separator:
            *parts, last_part = parts
            parts = [part + separator for part in parts]
            if last_part:
                return parts + [last_part]
        return parts
    

    read_batch_from_end to read batch from the right end of binary stream

    def read_batch_from_end(byte_stream, size, end_position):
        """
        Reads batch from the end of given byte stream.
        """
        if end_position > size:
            offset = end_position - size
        else:
            offset = 0
            size = end_position
        byte_stream.seek(offset)
        return byte_stream.read(size)
    

    After that we can define function for reading byte stream in reverse order like

    import functools
    import itertools
    import os
    from operator import methodcaller, sub
    
    
    def reverse_binary_stream(byte_stream, batch_size=None,
                              lines_separator=None,
                              keep_lines_separator=True):
        if lines_separator is None:
            lines_separator = (b'\r', b'\n', b'\r\n')
            lines_splitter = methodcaller(str.splitlines.__name__,
                                          keep_lines_separator)
        else:
            lines_splitter = functools.partial(split,
                                               separator=lines_separator,
                                               keep_separator=keep_lines_separator)
        stream_size = byte_stream.seek(0, os.SEEK_END)
        if batch_size is None:
            batch_size = stream_size or 1
        batches_count = ceil_division(stream_size, batch_size)
        remaining_bytes_indicator = itertools.islice(
                itertools.accumulate(itertools.chain([stream_size],
                                                     itertools.repeat(batch_size)),
                                     sub),
                batches_count)
        try:
            remaining_bytes_count = next(remaining_bytes_indicator)
        except StopIteration:
            return
    
        def read_batch(position):
            result = read_batch_from_end(byte_stream,
                                         size=batch_size,
                                         end_position=position)
            while result.startswith(lines_separator):
                try:
                    position = next(remaining_bytes_indicator)
                except StopIteration:
                    break
                result = (read_batch_from_end(byte_stream,
                                              size=batch_size,
                                              end_position=position)
                          + result)
            return result
    
        batch = read_batch(remaining_bytes_count)
        segment, *lines = lines_splitter(batch)
        yield from reverse(lines)
        for remaining_bytes_count in remaining_bytes_indicator:
            batch = read_batch(remaining_bytes_count)
            lines = lines_splitter(batch)
            if batch.endswith(lines_separator):
                yield segment
            else:
                lines[-1] += segment
            segment, *lines = lines
            yield from reverse(lines)
        yield segment
    

    and finally a function for reversing text file can be defined like:

    import codecs
    
    
    def reverse_file(file, batch_size=None, 
                     lines_separator=None,
                     keep_lines_separator=True):
        encoding = file.encoding
        if lines_separator is not None:
            lines_separator = lines_separator.encode(encoding)
        yield from map(functools.partial(codecs.decode,
                                         encoding=encoding),
                       reverse_binary_stream(
                               file.buffer,
                               batch_size=batch_size,
                               lines_separator=lines_separator,
                               keep_lines_separator=keep_lines_separator))
    

    Tests

    Preparations

    I've generated 4 files using fsutil command:

    1. empty.txt with no contents, size 0MB
    2. tiny.txt with size of 1MB
    3. small.txt with size of 10MB
    4. large.txt with size of 50MB

    also I've refactored @srohde solution to work with file object instead of file path.

    Test script

    from timeit import Timer
    
    repeats_count = 7
    number = 1
    create_setup = ('from collections import deque\n'
                    'from __main__ import reverse_file, reverse_readline\n'
                    'file = open("{}")').format
    srohde_solution = ('with file:\n'
                       '    deque(reverse_readline(file,\n'
                       '                           buf_size=8192),'
                       '          maxlen=0)')
    azat_ibrakov_solution = ('with file:\n'
                             '    deque(reverse_file(file,\n'
                             '                       lines_separator="\\n",\n'
                             '                       keep_lines_separator=False,\n'
                             '                       batch_size=8192), maxlen=0)')
    print('reversing empty file by "srohde"',
          min(Timer(srohde_solution,
                    create_setup('empty.txt')).repeat(repeats_count, number)))
    print('reversing empty file by "Azat Ibrakov"',
          min(Timer(azat_ibrakov_solution,
                    create_setup('empty.txt')).repeat(repeats_count, number)))
    print('reversing tiny file (1MB) by "srohde"',
          min(Timer(srohde_solution,
                    create_setup('tiny.txt')).repeat(repeats_count, number)))
    print('reversing tiny file (1MB) by "Azat Ibrakov"',
          min(Timer(azat_ibrakov_solution,
                    create_setup('tiny.txt')).repeat(repeats_count, number)))
    print('reversing small file (10MB) by "srohde"',
          min(Timer(srohde_solution,
                    create_setup('small.txt')).repeat(repeats_count, number)))
    print('reversing small file (10MB) by "Azat Ibrakov"',
          min(Timer(azat_ibrakov_solution,
                    create_setup('small.txt')).repeat(repeats_count, number)))
    print('reversing large file (50MB) by "srohde"',
          min(Timer(srohde_solution,
                    create_setup('large.txt')).repeat(repeats_count, number)))
    print('reversing large file (50MB) by "Azat Ibrakov"',
          min(Timer(azat_ibrakov_solution,
                    create_setup('large.txt')).repeat(repeats_count, number)))
    

    Note: I've used collections.deque class to exhaust generator.

    Outputs

    For PyPy 3.5 on Windows 10:

    reversing empty file by "srohde" 8.31e-05
    reversing empty file by "Azat Ibrakov" 0.00016090000000000028
    reversing tiny file (1MB) by "srohde" 0.160081
    reversing tiny file (1MB) by "Azat Ibrakov" 0.09594989999999998
    reversing small file (10MB) by "srohde" 8.8891863
    reversing small file (10MB) by "Azat Ibrakov" 5.323388100000001
    reversing large file (50MB) by "srohde" 186.5338368
    reversing large file (50MB) by "Azat Ibrakov" 99.07450229999998
    

    For CPython 3.5 on Windows 10:

    reversing empty file by "srohde" 3.600000000000001e-05
    reversing empty file by "Azat Ibrakov" 4.519999999999958e-05
    reversing tiny file (1MB) by "srohde" 0.01965560000000001
    reversing tiny file (1MB) by "Azat Ibrakov" 0.019207699999999994
    reversing small file (10MB) by "srohde" 3.1341862999999996
    reversing small file (10MB) by "Azat Ibrakov" 3.0872588000000007
    reversing large file (50MB) by "srohde" 82.01206720000002
    reversing large file (50MB) by "Azat Ibrakov" 82.16775059999998
    

    So as we can see it performs like original solution, but is more general and free of its disadvantages listed above.


    Advertisement

    I've added this to 0.3.0 version of lz package (requires Python 3.5+) that have many well-tested functional/iterating utilities.

    Can be used like

     import io
     from lz.iterating import reverse
     ...
     with open('path/to/file') as file:
         for line in reverse(file, batch_size=io.DEFAULT_BUFFER_SIZE):
             print(line)
    

    It supports all standard encodings (maybe except utf-7 since it is hard for me to define a strategy for generating strings encodable with it).

    0 讨论(0)
  • 2020-11-22 05:54

    Here you can find my my implementation, you can limit the ram usage by changing the "buffer" variable, there is a bug that the program prints an empty line in the beginning.

    And also ram usage may be increase if there is no new lines for more than buffer bytes, "leak" variable will increase until seeing a new line ("\n").

    This is also working for 16 GB files which is bigger then my total memory.

    import os,sys
    buffer = 1024*1024 # 1MB
    f = open(sys.argv[1])
    f.seek(0, os.SEEK_END)
    filesize = f.tell()
    
    division, remainder = divmod(filesize, buffer)
    line_leak=''
    
    for chunk_counter in range(1,division + 2):
        if division - chunk_counter < 0:
            f.seek(0, os.SEEK_SET)
            chunk = f.read(remainder)
        elif division - chunk_counter >= 0:
            f.seek(-(buffer*chunk_counter), os.SEEK_END)
            chunk = f.read(buffer)
    
        chunk_lines_reversed = list(reversed(chunk.split('\n')))
        if line_leak: # add line_leak from previous chunk to beginning
            chunk_lines_reversed[0] += line_leak
    
        # after reversed, save the leakedline for next chunk iteration
        line_leak = chunk_lines_reversed.pop()
    
        if chunk_lines_reversed:
            print "\n".join(chunk_lines_reversed)
        # print the last leaked line
        if division - chunk_counter < 0:
            print line_leak
    
    0 讨论(0)
提交回复
热议问题