I need to read a large file, line by line. Lets say that file has more than 5GB and I need to read each line, but obviously I do not want to use readlines()
bec
Heres the code for loading text files of any size without causing memory issues. It support gigabytes sized files
https://gist.github.com/iyvinjose/e6c1cb2821abd5f01fd1b9065cbc759d
download the file data_loading_utils.py and import it into your code
usage
import data_loading_utils.py.py
file_name = 'file_name.ext'
CHUNK_SIZE = 1000000
def process_lines(data, eof, file_name):
# check if end of file reached
if not eof:
# process data, data is one single line of the file
else:
# end of file reached
data_loading_utils.read_lines_from_file_as_data_chunks(file_name, chunk_size=CHUNK_SIZE, callback=self.process_lines)
process_lines method is the callback function. It will be called for all the lines, with parameter data representing one single line of the file at a time.
You can configure the variable CHUNK_SIZE depending on your machine hardware configurations.
Thank you! I have recently converted to python 3 and have been frustrated by using readlines(0) to read large files. This solved the problem. But to get each line, I had to do a couple extra steps. Each line was preceded by a "b'" which I guess that it was in binary format. Using "decode(utf-8)" changed it ascii.
Then I had to remove a "=\n" in the middle of each line.
Then I split the lines at the new line.
b_data=(fh.read(ele[1]))#endat This is one chunk of ascii data in binary format
a_data=((binascii.b2a_qp(b_data)).decode('utf-8')) #Data chunk in 'split' ascii format
data_chunk = (a_data.replace('=\n','').strip()) #Splitting characters removed
data_list = data_chunk.split('\n') #List containing lines in chunk
#print(data_list,'\n')
#time.sleep(1)
for j in range(len(data_list)): #iterate through data_list to get each item
i += 1
line_of_data = data_list[j]
print(line_of_data)
Here is the code starting just above "print data" in Arohi's code.
This might be useful when you want to work in parallel and read only chunks of data but keep it clean with new lines.
def readInChunks(fileObj, chunkSize=1024):
while True:
data = fileObj.read(chunkSize)
if not data:
break
while data[-1:] != '\n':
data+=fileObj.read(1)
yield data
An old school approach:
fh = open(file_name, 'rt')
line = fh.readline()
while line:
# do stuff with line
line = fh.readline()
fh.close()
I demonstrated a parallel byte level random access approach here in this other question:
Getting number of lines in a text file without readlines
Some of the answers already provided are nice and concise. I like some of them. But it really depends what you want to do with the data that's in the file. In my case I just wanted to count lines, as fast as possible on big text files. My code can be modified to do other things too of course, like any code.
I provided this answer because Keith's, while succinct, doesn't close the file explicitly
with open("log.txt") as infile:
for line in infile:
do_something_with(line)