How to processes the extremely large dataset into chunks in Python (Pandas), while considering the full dataset for application of function?

问题

I have read numerous threads on similar topics on the forum. However, what I am asking here, I believe, it is not a duplicate question.

I am reading a very large dataset (22 gb) of CSV format, having 350 million rows. I am trying to read the dataset in chunks, based on the solution provided by that link.

My current code is as following.

import pandas as pd

def Group_ID_Company(chunk_of_dataset):
    return chunk_of_dataset.groupby(['id', 'company'])[['purchasequantity', 'purchaseamount']].sum()

chunk_size = 9000000
chunk_skip = 1

transactions_dataset_DF = pd.read_csv('transactions.csv', skiprows = range(1, chunk_skip), nrows = chunk_size)
Group_ID_Company(transactions_dataset_DF.reset_index()).to_csv('Group_ID_Company.csv')

for i in range(0, 38):
    chunk_skip += chunk_size;
    transactions_dataset_DF = pd.read_csv('transactions.csv', skiprows = range(1, chunk_skip), nrows = chunk_size)
    Group_ID_Company(transactions_dataset_DF.reset_index()).to_csv('Group_ID_Company.csv', mode = 'a', header = False)

There is no issue with the code, it runs fine. But, it, groupby(['id', 'company'])[['purchasequantity', 'purchaseamount']].sum() only runs for 9000000 rows, which is the declared as chunk_size. Whereas, I need to run that statement for the entire dataset, not chunk by chunk.

Reason for that is, when it is run chunk by chunk, only one chunk get processed, however, there are a lot of other rows which are scattered all over the dataset and get left behind into another chunk.

A possible solution is to run the code again on the newly generated "Group_ID_Company.csv". By doing so, code will go through new dataset once again and sum() the required columns. However, I am thinking may be there is another (better) way of achieving that.

回答1:

The solution for your problem is probably Dask. You may watch the introductory video, read examples and try them online in a live session (in JupyterLab).

回答2:

The answer form MarianD worked perfectly, I am answering to share the solution code here.

Moreover, DASK is able to utilize all cores equally, whereas, Pandas was using the only one core to 100%. So, that's the another benefit of DASK, I have noticed over Pandas.

import dask.dataframe as dd

transactions_dataset_DF = dd.read_csv('transactions.csv')
Group_ID_Company_DF = transactions_dataset_DF.groupby(['id', 'company'])[['purchasequantity', 'purchaseamount']].sum().compute()
Group_ID_Company_DF.to_csv('Group_ID_Company.csv')

# to clear the memory
transactions_dataset_DF = None
Group_ID_Company_DF = None

DASK has been able to read all 350 million rows (20 GB of dataset) at once. Which was not achieved by Pandas previously, I had to create 37 chunks to process the entire dataset and it took almost 2 hours to complete the processing using Pandas.

However, with the DASK, it only took around 20 mins to (at once) read, process and then save the new dataset.

来源：https://stackoverflow.com/questions/65130143/how-to-processes-the-extremely-large-dataset-into-chunks-in-python-pandas-whi

标签

python

pandas

dataframe

group-by

dataset