Lazy Method for Reading Big File in Python?

前端未结

关注

 12  1747

I have a very big file 4GB and when I try to read it my computer hangs. So I want to read it piece by piece and after processing each piece store the processed piece into an

相关标签:

12条回答

暗喜

2020-11-21 06:56

To write a lazy function, just use yield:

def read_in_chunks(file_object, chunk_size=1024):
    """Lazy function (generator) to read a file piece by piece.
    Default chunk size: 1k."""
    while True:
        data = file_object.read(chunk_size)
        if not data:
            break
        yield data


with open('really_big_file.dat') as f:
    for piece in read_in_chunks(f):
        process_data(piece)

Another option would be to use iter and a helper function:

f = open('really_big_file.dat')
def read1k():
    return f.read(1024)

for piece in iter(read1k, ''):
    process_data(piece)

If the file is line-based, the file object is already a lazy generator of lines:

for line in open('really_big_file.dat'):
    process_data(line)

0 讨论(0)

有刺的猬

2020-11-21 06:56
I'm in a somewhat similar situation. It's not clear whether you know chunk size in bytes; I usually don't, but the number of records (lines) that is required is known:
```
def get_line():
     with open('4gb_file') as file:
         for i in file:
             yield i

lines_required = 100
gen = get_line()
chunk = [i for i, j in zip(gen, range(lines_required))]
```
Update: Thanks nosklo. Here's what I meant. It almost works, except that it loses a line 'between' chunks.
```
chunk = [next(gen) for i in range(lines_required)]
```
Does the trick w/o losing any lines, but it doesn't look very nice.
0 讨论(0)
发布评论:

提交评论
- 加载中...

刺人心

2020-11-21 07:00

If your computer, OS and python are 64-bit, then you can use the mmap module to map the contents of the file into memory and access it with indices and slices. Here an example from the documentation:

import mmap
with open("hello.txt", "r+") as f:
    # memory-map the file, size 0 means whole file
    map = mmap.mmap(f.fileno(), 0)
    # read content via standard file methods
    print map.readline()  # prints "Hello Python!"
    # read content via slice notation
    print map[:5]  # prints "Hello"
    # update content using slice notation;
    # note that new content must have same size
    map[6:] = " world!\n"
    # ... and read again using standard file methods
    map.seek(0)
    print map.readline()  # prints "Hello  world!"
    # close the map
    map.close()

If either your computer, OS or python are 32-bit, then mmap-ing large files can reserve large parts of your address space and starve your program of memory.

0 讨论(0)

名媛妹妹

2020-11-21 07:00
i am not allowed to comment due to my low reputation, but SilentGhosts solution should be much easier with file.readlines([sizehint])

python file methods

edit: SilentGhost is right, but this should be better than:
```
s = "" 
for i in xrange(100): 
   s += file.next()
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

太阳男子

2020-11-21 07:00

To process line by line, this is an elegant solution:

  def stream_lines(file_name):
    file = open(file_name)
    while True:
      line = file.readline()
      if not line:
        file.close()
        break
      yield line

As long as there're no blank lines.

0 讨论(0)

旧巷少年郎

2020-11-21 07:04

f = ... # file-like object, i.e. supporting read(size) function and 
        # returning empty string '' when there is nothing to read

def chunked(file, chunk_size):
    return iter(lambda: file.read(chunk_size), '')

for data in chunked(f, 65536):
    # process the data

UPDATE: The approach is best explained in https://stackoverflow.com/a/4566523/38592

0 讨论(0)

1 2 下一页