Python Memory Error when reading large files , need ideas to apply mutiprocessing in below case?

守給你的承諾、 提交于 2020-07-16 04:22:46

问题


I have the file which stores the data in the below format

TIME[04.26_12:30:30:853664]ID[ROLL:201987623]MARKS[PHY:100|MATH:200|CHEM:400]
TIME[03.27_12:29:30.553669]ID[ROLL:201987623]MARKS[PHY:100|MATH:1200|CHEM:900]
TIME[03.26_12:28:30.753664]ID[ROLL:2341987623]MARKS[PHY:100|MATH:200|CHEM:400]
TIME[03.26_12:29:30.853664]ID[ROLL:201978623]MARKS[PHY:0|MATH:0|CHEM:40]
TIME[04.27_12:29:30.553664]ID[ROLL:2034287623]MARKS[PHY:100|MATH:200|CHEM:400]

Below method I found to fulfill the need given in this question please refer this link for clarification

import re
from itertools import groupby

regex = re.compile(r"^.*TIME\[([^]]+)\]ID\[ROLL:([^]]+)\].+$")
def func1(arg) -> bool:
    return regex.match(arg)


def func2(arg) -> str:
    match = regex.match(arg)
    if match:
        return match.group(1)
    return ""


def func3(arg) -> int:
    match = regex.match(arg)
    if match:
        return int(match.group(2))
    return 0
with open(your_input_file) as fr:
    collection = filter(func1, fr)
    collection = sorted(collection, key=func2)
    collection = sorted(collection, key=func3)
    for key, group in groupby(collection, key=func3):
        with open(f"ROLL_{key}", mode="w") as fw:
            fw.writelines(group)

The above function is creating the files according to my wish also , it's sorting the file_contents according to time stamps and I am getting correct output so i tried it for large files of the size 1.7 GB it's giving memory error I tried to use the following method

Failed attempt:

  with open(my_file.txt) as fr:
        part_read = partial(fr.read, 1024 * 1024)
        iterator = iter(part_read, b'')
        for index, fra in enumerate(iterator, start=1):
         collection = filter(func1, fra)
         collection = sorted(collection, key=func2)
         collection = sorted(collection, key=func3)
         for key, group in groupby(collection, key=func3):
            fw=open(f'ROLL_{key}.txt','a')
            fw.writelines(group)

This attempt doesn't gave me any results means there was no file created at all it's taking unexpectedly huge time , i found in many of the answers to read file line by line then how I will then sort it , please suggest me suggestions to improve this code or any new idea if I need to use multiprocessing here to process faster ,if that is the case How to use it?

And One main condition with me is I can't store it any data structure since file can be huge in size


回答1:


And if you want read file by chunk, use this:

import re
from functools import partial
from itertools import groupby
from typing import Tuple

regex = re.compile(r"^.*TIME\[([^]]+)\]ID\[ROLL:([^]]+)\].+$")
def func1(arg) -> bool:
    return regex.match(arg)


def func2(arg) -> Tuple[str, int]:
    match = regex.match(arg)
    if match:
        return match.group(1), int(match.group(2))
    return "", 0

def func3(arg) -> int:
    match = regex.match(arg)
    if match:
        return int(match.group(2))
    return 0

def read_in_chunks(file_object, chunk_size=1024*1024):
    while True:
        data = file_object.read(chunk_size)
        if not data:
            break
        yield data

with open('b.txt') as fr:
    for chunk in read_in_chunks(fr):
        collection = filter(func1, chunk.splitlines())
        collection = sorted(collection, key=func2)
        for key, group in groupby(collection, key=func3):
            with open(f"ROLL_{key}", mode="wa") as fw:
                fw.writelines(group)



回答2:


let's try to sort at one time.

import re
from itertools import groupby
from typing import Tuple

regex = re.compile(r"^.*TIME\[([^]]+)\]ID\[ROLL:([^]]+)\].+$")
def func1(arg) -> bool:
    return regex.match(arg)


def func2(arg) -> Tuple[str, int]:
    match = regex.match(arg)
    if match:
        return match.group(1), int(match.group(2))
    return "", 0

def func3(arg) -> int:
    match = regex.match(arg)
    if match:
        return int(match.group(2))
    return 0

with open('b.txt') as fr:
    collection = filter(func1, fr)
    collection = sorted(collection, key=func2)
    for key, group in groupby(collection, key=func3):
        with open(f"ROLL_{key}", mode="w") as fw:
            fw.writelines(group)


来源:https://stackoverflow.com/questions/62702224/python-memory-error-when-reading-large-files-need-ideas-to-apply-mutiprocessin

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!