How can I read large text files in Python, line by line, without loading it into memory?

前端 未结 15 1305
臣服心动
臣服心动 2020-11-22 03:32

I need to read a large file, line by line. Lets say that file has more than 5GB and I need to read each line, but obviously I do not want to use readlines() bec

相关标签:
15条回答
  • 2020-11-22 04:14

    Heres the code for loading text files of any size without causing memory issues. It support gigabytes sized files

    https://gist.github.com/iyvinjose/e6c1cb2821abd5f01fd1b9065cbc759d

    download the file data_loading_utils.py and import it into your code

    usage

    import data_loading_utils.py.py
    file_name = 'file_name.ext'
    CHUNK_SIZE = 1000000
    
    
    def process_lines(data, eof, file_name):
    
        # check if end of file reached
        if not eof:
             # process data, data is one single line of the file
    
        else:
             # end of file reached
    
    data_loading_utils.read_lines_from_file_as_data_chunks(file_name, chunk_size=CHUNK_SIZE, callback=self.process_lines)
    

    process_lines method is the callback function. It will be called for all the lines, with parameter data representing one single line of the file at a time.

    You can configure the variable CHUNK_SIZE depending on your machine hardware configurations.

    0 讨论(0)
  • 2020-11-22 04:14

    Thank you! I have recently converted to python 3 and have been frustrated by using readlines(0) to read large files. This solved the problem. But to get each line, I had to do a couple extra steps. Each line was preceded by a "b'" which I guess that it was in binary format. Using "decode(utf-8)" changed it ascii.

    Then I had to remove a "=\n" in the middle of each line.

    Then I split the lines at the new line.

    b_data=(fh.read(ele[1]))#endat This is one chunk of ascii data in binary format
            a_data=((binascii.b2a_qp(b_data)).decode('utf-8')) #Data chunk in 'split' ascii format
            data_chunk = (a_data.replace('=\n','').strip()) #Splitting characters removed
            data_list = data_chunk.split('\n')  #List containing lines in chunk
            #print(data_list,'\n')
            #time.sleep(1)
            for j in range(len(data_list)): #iterate through data_list to get each item 
                i += 1
                line_of_data = data_list[j]
                print(line_of_data)
    

    Here is the code starting just above "print data" in Arohi's code.

    0 讨论(0)
  • 2020-11-22 04:14

    This might be useful when you want to work in parallel and read only chunks of data but keep it clean with new lines.

    def readInChunks(fileObj, chunkSize=1024):
        while True:
            data = fileObj.read(chunkSize)
            if not data:
                break
            while data[-1:] != '\n':
                data+=fileObj.read(1)
            yield data
    
    0 讨论(0)
  • 2020-11-22 04:16

    An old school approach:

    fh = open(file_name, 'rt')
    line = fh.readline()
    while line:
        # do stuff with line
        line = fh.readline()
    fh.close()
    
    0 讨论(0)
  • 2020-11-22 04:16

    I demonstrated a parallel byte level random access approach here in this other question:

    Getting number of lines in a text file without readlines

    Some of the answers already provided are nice and concise. I like some of them. But it really depends what you want to do with the data that's in the file. In my case I just wanted to count lines, as fast as possible on big text files. My code can be modified to do other things too of course, like any code.

    0 讨论(0)
  • 2020-11-22 04:19

    I provided this answer because Keith's, while succinct, doesn't close the file explicitly

    with open("log.txt") as infile:
        for line in infile:
            do_something_with(line)
    
    0 讨论(0)
提交回复
热议问题