How do I make a progress bar for loading pandas DataFrame from a large xlsx file?

我只是一个虾纸丫 提交于 2020-06-08 17:06:31

问题


from https://pypi.org/project/tqdm/:

import pandas as pd
import numpy as np
from tqdm import tqdm

df = pd.DataFrame(np.random.randint(0, 100, (100000, 6)))
tqdm.pandas(desc="my bar!")p`
df.progress_apply(lambda x: x**2)

I took this code and edited it so that I create a DataFrame from load_excel rather than using random numbers:

import pandas as pd
from tqdm import tqdm
import numpy as np

filename="huge_file.xlsx"
df = pd.DataFrame(pd.read_excel(filename))
tqdm.pandas()
df.progress_apply(lambda x: x**2)

This gave me an error, so I changed df.progress_apply to this:

df.progress_apply(lambda x: x)

Here is the final code:

import pandas as pd
from tqdm import tqdm
import numpy as np

filename="huge_file.xlsx"
df = pd.DataFrame(pd.read_excel(filename))
tqdm.pandas()
df.progress_apply(lambda x: x)

This results in a progress bar, but it doesn't actually show any progress, rather it loads the bar, and when the operation is done it jumps to 100%, defeating the purpose.

My question is this: How do I make this progress bar work?
What does the function inside of progress_apply actually do?
Is there a better approach? Maybe an alternative to tqdm?

Any help is greatly appreciated.


回答1:


Will not work. pd.read_excel blocks until the file is read, and there is no way to get information from this function about its progress during execution.

It would work for read operations which you can do chunk wise, like

chunks = []
for chunk in pd.read_csv(..., chunksize=1000):
    update_progressbar()
    chunks.append(chunk)

But as far as I understand tqdm also needs the number of chunks in advance, so for a propper progress report you would need to read the full file first....




回答2:


DISCLAIMER: This works only with xlrd engine and is not thoroughly tested!

How it works? We monkey-patch xlrd.xlsx.X12Sheet.own_process_stream method that is responsible to load sheets from file-like stream. We supply own stream, that contains our progress bar. Each sheet has it's own progress bar.

When we want the progress bar, we use load_with_progressbar() context manager and then do pd.read_excel('<FILE.xlsx>').

import xlrd
from tqdm import tqdm
from io import RawIOBase
from contextlib import contextmanager


class progress_reader(RawIOBase):
    def __init__(self, zf, bar):
        self.bar = bar
        self.zf = zf

    def readinto(self, b):
        n = self.zf.readinto(b)
        self.bar.update(n=n)
        return n


@contextmanager
def load_with_progressbar():

    def my_get_sheet(self, zf, *other, **kwargs):
        with tqdm(total=zf._orig_file_size) as bar:
            sheet = _tmp(self, progress_reader(zf, bar), **kwargs)
        return sheet

    _tmp = xlrd.xlsx.X12Sheet.own_process_stream

    try:
        xlrd.xlsx.X12Sheet.own_process_stream = my_get_sheet
        yield
    finally:
        xlrd.xlsx.X12Sheet.own_process_stream = _tmp


import pandas as pd

with load_with_progressbar():
    df = pd.read_excel('sample2.xlsx')

print(df)

Screenshot of progress bar:



来源:https://stackoverflow.com/questions/52209290/how-do-i-make-a-progress-bar-for-loading-pandas-dataframe-from-a-large-xlsx-file

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!