How can I read large text files in Python, line by line, without loading it into memory?

前端未结

关注

 15  1305

I need to read a large file, line by line. Lets say that file has more than 5GB and I need to read each line, but obviously I do not want to use readlines() bec

相关标签:

15条回答

不知归路

2020-11-22 04:14
Heres the code for loading text files of any size without causing memory issues. It support gigabytes sized files

https://gist.github.com/iyvinjose/e6c1cb2821abd5f01fd1b9065cbc759d

download the file data_loading_utils.py and import it into your code

usage
```
import data_loading_utils.py.py
file_name = 'file_name.ext'
CHUNK_SIZE = 1000000


def process_lines(data, eof, file_name):

    # check if end of file reached
    if not eof:
         # process data, data is one single line of the file

    else:
         # end of file reached

data_loading_utils.read_lines_from_file_as_data_chunks(file_name, chunk_size=CHUNK_SIZE, callback=self.process_lines)
```
process_lines method is the callback function. It will be called for all the lines, with parameter data representing one single line of the file at a time.

You can configure the variable CHUNK_SIZE depending on your machine hardware configurations.
0 讨论(0)
发布评论:

提交评论
- 加载中...

心在旅途

2020-11-22 04:14

Thank you! I have recently converted to python 3 and have been frustrated by using readlines(0) to read large files. This solved the problem. But to get each line, I had to do a couple extra steps. Each line was preceded by a "b'" which I guess that it was in binary format. Using "decode(utf-8)" changed it ascii.

Then I had to remove a "=\n" in the middle of each line.

Then I split the lines at the new line.

b_data=(fh.read(ele[1]))#endat This is one chunk of ascii data in binary format
        a_data=((binascii.b2a_qp(b_data)).decode('utf-8')) #Data chunk in 'split' ascii format
        data_chunk = (a_data.replace('=\n','').strip()) #Splitting characters removed
        data_list = data_chunk.split('\n')  #List containing lines in chunk
        #print(data_list,'\n')
        #time.sleep(1)
        for j in range(len(data_list)): #iterate through data_list to get each item 
            i += 1
            line_of_data = data_list[j]
            print(line_of_data)

Here is the code starting just above "print data" in Arohi's code.

0 讨论(0)

悲哀的现实

2020-11-22 04:14

This might be useful when you want to work in parallel and read only chunks of data but keep it clean with new lines.

def readInChunks(fileObj, chunkSize=1024):
    while True:
        data = fileObj.read(chunkSize)
        if not data:
            break
        while data[-1:] != '\n':
            data+=fileObj.read(1)
        yield data

0 讨论(0)

臣服心动

2020-11-22 04:16

An old school approach:

fh = open(file_name, 'rt')
line = fh.readline()
while line:
    # do stuff with line
    line = fh.readline()
fh.close()

0 讨论(0)

借酒劲吻你

2020-11-22 04:16

I demonstrated a parallel byte level random access approach here in this other question:

Getting number of lines in a text file without readlines

Some of the answers already provided are nice and concise. I like some of them. But it really depends what you want to do with the data that's in the file. In my case I just wanted to count lines, as fast as possible on big text files. My code can be modified to do other things too of course, like any code.

0 讨论(0)
发布评论:

提交评论
- 加载中...
独厮守ぢ

2020-11-22 04:19
I provided this answer because Keith's, while succinct, doesn't close the file explicitly
```
with open("log.txt") as infile:
    for line in infile:
        do_something_with(line)
```
0 讨论(0)
发布评论:

提交评论
- 加载中...