How can I read large text files in Python, line by line, without loading it into memory?

前端未结

关注

 15  1304

I need to read a large file, line by line. Lets say that file has more than 5GB and I need to read each line, but obviously I do not want to use readlines() bec

相关标签:

15条回答

囚心锁ツ

2020-11-22 04:00
You are better off using an iterator instead. Relevant: http://docs.python.org/library/fileinput.html

From the docs:
```
import fileinput
for line in fileinput.input("filename"):
    process(line)
```
This will avoid copying the whole file into memory at once.
0 讨论(0)
发布评论:

提交评论
- 加载中...

粉色の甜心

2020-11-22 04:01

How about this? Divide your file into chunks and then read it line by line, because when you read a file, your operating system will cache the next line. If you are reading the file line by line, you are not making efficient use of the cached information.

Instead, divide the file into chunks and load the whole chunk into memory and then do your processing.

def chunks(file,size=1024):
    while 1:

        startat=fh.tell()
        print startat #file's object current position from the start
        fh.seek(size,1) #offset from current postion -->1
        data=fh.readline()
        yield startat,fh.tell()-startat #doesnt store whole list in memory
        if not data:
            break
if os.path.isfile(fname):
    try:
        fh=open(fname,'rb') 
    except IOError as e: #file --> permission denied
        print "I/O error({0}): {1}".format(e.errno, e.strerror)
    except Exception as e1: #handle other exceptions such as attribute errors
        print "Unexpected error: {0}".format(e1)
    for ele in chunks(fh):
        fh.seek(ele[0])#startat
        data=fh.read(ele[1])#endat
        print data

0 讨论(0)

既然无缘

2020-11-22 04:04
Please try this:
```
with open('filename','r',buffering=100000) as f:
    for line in f:
        print line
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
一整个雨季

2020-11-22 04:07
The blaze project has come a long way over the last 6 years. It has a simple API covering a useful subset of pandas features.

dask.dataframe takes care of chunking internally, supports many parallelisable operations and allows you to export slices back to pandas easily for in-memory operations.
```
import dask.dataframe as dd

df = dd.read_csv('filename.csv')
df.head(10)  # return first 10 rows
df.tail(10)  # return last 10 rows

# iterate rows
for idx, row in df.iterrows():
    ...

# group by my_field and return mean
df.groupby(df.my_field).value.mean().compute()

# slice by column
df[df.my_field=='XYZ'].compute()
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
滥情空心

2020-11-22 04:08
All you need to do is use the file object as an iterator.
```
for line in open("log.txt"):
    do_something_with(line)
```
Even better is using context manager in recent Python versions.
```
with open("log.txt") as fileobject:
    for line in fileobject:
        do_something_with(line)
```
This will automatically close the file as well.
0 讨论(0)
发布评论:

提交评论
- 加载中...
北恋

2020-11-22 04:10
I couldn't believe that it could be as easy as @john-la-rooy's answer made it seem. So, I recreated the cp command using line by line reading and writing. It's CRAZY FAST.
```
#!/usr/bin/env python3.6

import sys

with open(sys.argv[2], 'w') as outfile:
    with open(sys.argv[1]) as infile:
        for line in infile:
            outfile.write(line)
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 3 下一页