Lazy Method for Reading Big File in Python?

前端 未结 12 1772
谎友^
谎友^ 2020-11-21 06:40

I have a very big file 4GB and when I try to read it my computer hangs. So I want to read it piece by piece and after processing each piece store the processed piece into an

相关标签:
12条回答
  • 2020-11-21 07:05

    There are already many good answers, but if your entire file is on a single line and you still want to process "rows" (as opposed to fixed-size blocks), these answers will not help you.

    99% of the time, it is possible to process files line by line. Then, as suggested in this answer, you can to use the file object itself as lazy generator:

    with open('big.csv') as f:
        for line in f:
            process(line)
    

    However, I once ran into a very very big (almost) single line file, where the row separator was in fact not '\n' but '|'.

    • Reading line by line was not an option, but I still needed to process it row by row.
    • Converting'|' to '\n' before processing was also out of the question, because some of the fields of this csv contained '\n' (free text user input).
    • Using the csv library was also ruled out because the fact that, at least in early versions of the lib, it is hardcoded to read the input line by line.

    For these kind of situations, I created the following snippet:

    def rows(f, chunksize=1024, sep='|'):
        """
        Read a file where the row separator is '|' lazily.
    
        Usage:
    
        >>> with open('big.csv') as f:
        >>>     for r in rows(f):
        >>>         process(row)
        """
        curr_row = ''
        while True:
            chunk = f.read(chunksize)
            if chunk == '': # End of file
                yield curr_row
                break
            while True:
                i = chunk.find(sep)
                if i == -1:
                    break
                yield curr_row + chunk[:i]
                curr_row = ''
                chunk = chunk[i+1:]
            curr_row += chunk
    

    I was able to use it successfully to solve my problem. It has been extensively tested, with various chunk sizes.


    Test suite, for those who want to convince themselves.

    test_file = 'test_file'
    
    def cleanup(func):
        def wrapper(*args, **kwargs):
            func(*args, **kwargs)
            os.unlink(test_file)
        return wrapper
    
    @cleanup
    def test_empty(chunksize=1024):
        with open(test_file, 'w') as f:
            f.write('')
        with open(test_file) as f:
            assert len(list(rows(f, chunksize=chunksize))) == 1
    
    @cleanup
    def test_1_char_2_rows(chunksize=1024):
        with open(test_file, 'w') as f:
            f.write('|')
        with open(test_file) as f:
            assert len(list(rows(f, chunksize=chunksize))) == 2
    
    @cleanup
    def test_1_char(chunksize=1024):
        with open(test_file, 'w') as f:
            f.write('a')
        with open(test_file) as f:
            assert len(list(rows(f, chunksize=chunksize))) == 1
    
    @cleanup
    def test_1025_chars_1_row(chunksize=1024):
        with open(test_file, 'w') as f:
            for i in range(1025):
                f.write('a')
        with open(test_file) as f:
            assert len(list(rows(f, chunksize=chunksize))) == 1
    
    @cleanup
    def test_1024_chars_2_rows(chunksize=1024):
        with open(test_file, 'w') as f:
            for i in range(1023):
                f.write('a')
            f.write('|')
        with open(test_file) as f:
            assert len(list(rows(f, chunksize=chunksize))) == 2
    
    @cleanup
    def test_1025_chars_1026_rows(chunksize=1024):
        with open(test_file, 'w') as f:
            for i in range(1025):
                f.write('|')
        with open(test_file) as f:
            assert len(list(rows(f, chunksize=chunksize))) == 1026
    
    @cleanup
    def test_2048_chars_2_rows(chunksize=1024):
        with open(test_file, 'w') as f:
            for i in range(1022):
                f.write('a')
            f.write('|')
            f.write('a')
            # -- end of 1st chunk --
            for i in range(1024):
                f.write('a')
            # -- end of 2nd chunk
        with open(test_file) as f:
            assert len(list(rows(f, chunksize=chunksize))) == 2
    
    @cleanup
    def test_2049_chars_2_rows(chunksize=1024):
        with open(test_file, 'w') as f:
            for i in range(1022):
                f.write('a')
            f.write('|')
            f.write('a')
            # -- end of 1st chunk --
            for i in range(1024):
                f.write('a')
            # -- end of 2nd chunk
            f.write('a')
        with open(test_file) as f:
            assert len(list(rows(f, chunksize=chunksize))) == 2
    
    if __name__ == '__main__':
        for chunksize in [1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024]:
            test_empty(chunksize)
            test_1_char_2_rows(chunksize)
            test_1_char(chunksize)
            test_1025_chars_1_row(chunksize)
            test_1024_chars_2_rows(chunksize)
            test_1025_chars_1026_rows(chunksize)
            test_2048_chars_2_rows(chunksize)
            test_2049_chars_2_rows(chunksize)
    
    0 讨论(0)
  • 2020-11-21 07:07

    Refer to python's official documentation https://docs.python.org/3/library/functions.html#iter

    Maybe this method is more pythonic:

    from functools import partial
    
    """A file object returned by open() is a iterator with
    read method which could specify current read's block size"""
    with open('mydata.db', 'r') as f_in:
    
        part_read = partial(f_in.read, 1024*1024)
        iterator = iter(part_read, b'')
    
        for index, block in enumerate(iterator, start=1):
            block = process_block(block)    # process your block data
            
            with open(f'{index}.txt', 'w') as f_out:
                f_out.write(block)
    
    0 讨论(0)
  • 2020-11-21 07:08

    file.readlines() takes in an optional size argument which approximates the number of lines read in the lines returned.

    bigfile = open('bigfilename','r')
    tmp_lines = bigfile.readlines(BUF_SIZE)
    while tmp_lines:
        process([line for line in tmp_lines])
        tmp_lines = bigfile.readlines(BUF_SIZE)
    
    0 讨论(0)
  • 2020-11-21 07:16

    In Python 3.8+ you can use .read() in a while loop:

    with open("somefile.txt") as f:
        while chunk := f.read(8192):
            do_something(chunk)
    

    Of course, you can use any chunk size you want, you don't have to use 8192 (2**13) bytes. Unless your file's size happens to be a multiple of your chunk size, the last chunk will be smaller than your chunk size.

    0 讨论(0)
  • 2020-11-21 07:21

    I think we can write like this:

    def read_file(path, block_size=1024): 
        with open(path, 'rb') as f: 
            while True: 
                piece = f.read(block_size) 
                if piece: 
                    yield piece 
                else: 
                    return
    
    for piece in read_file(path):
        process_piece(piece)
    
    0 讨论(0)
  • 2020-11-21 07:21

    you can use following code.

    file_obj = open('big_file') 
    

    open() returns a file object

    then use os.stat for getting size

    file_size = os.stat('big_file').st_size
    
    for i in range( file_size/1024):
        print file_obj.read(1024)
    
    0 讨论(0)
提交回复
热议问题