Pandas and Multiprocessing Memory Management: Splitting a DataFrame into Multiple Chunks

后端未结

关注

 2  1866

难免孤独 2021-02-06 06:24

I have to process a huge pandas.DataFrame (several tens of GB) on a row by row bases, where each row operation is quite lengthy (a couple of tens of milliseconds).

2条回答

梦毁少年i (楼主)

2021-02-06 06:55
A better implementation is just to use the pandas implementation of chunked dataframe as a generator and feed it into the "pool.imap" function pd.read_csv('.csv', chucksize=) https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

Benefit: It doesn't read in the whole df in your main process (save memory). Each child process will be pointed the chunk it needs only. --> solve the child memory issue.

Overhead: It requires you to save your df as csv first and read it in again using pd.read_csv --> I/O time.

Note: chunksize is not available to pd.read_pickle or other loading methods that are compressed on storage.
```
def main():
    # Job parameters
    n_jobs = 4  # Poolsize
    size = (10000, 1000)  # Size of DataFrame
    chunksize = 100  # Maximum size of Frame Chunk

    # Preparation
    df = pd.DataFrame(np.random.rand(*size))
    pool = mp.Pool(n_jobs)

    print('Starting MP')

    # Execute the wait and print function in parallel

    df_chunked = pd.read_csv('.csv',chunksize = chunksize) # modified
    pool.imap(just_wait_and_print_len_and_idx, df_chunking(df, df_chunked) # modified

    pool.close()
    pool.join()

    print('DONE')
```
0 讨论(0)

查看其它2个回答
发布评论:

提交评论
- 加载中...