发表新帖

发表新帖

Strategy for partitioning dask dataframes efficiently

后端未结

关注

 3  1995

The documentation for Dask talks about repartioning to reduce overhead here.

They however seem to indicate you need some knowledge of what your dataframe will look l

相关标签:

3条回答

忘掉有多难

2020-12-28 16:57

As of Dask 2.0.0 you may call .repartition(partition_size="100MB").

This method performs an object-considerate (.memory_usage(deep=True)) breakdown of partition size. It will join smaller partitions, or split partitions that have grown too large.

Dask's Documentation also outlines the usage.

0 讨论(0)
发布评论:

提交评论
- 加载中...
半阙折子戏

2020-12-28 16:57

After discussion with mrocklin a decent strategy for partitioning is to aim for 100MB partition sizes guided by df.memory_usage().sum().compute(). With datasets that fit in RAM the additional work this might involve can be mitigated with use of df.persist() placed at relevant points.

0 讨论(0)
发布评论:

提交评论
- 加载中...
梦谈多话

2020-12-28 17:08

Just to add to Samantha Hughes' answer:

memory_usage() by default ignores memory consumption of object dtype columns. For the datasets I have been working with recently this leads to an underestimate of memory usage of about 10x.

Unless you are sure there are no object dtype columns I would suggest specifying deep=True, that is, repartition using:

df.repartition(npartitions= 1+df.memory_usage(deep=True).sum().compute() // n )

Where n is your target partition size in bytes. Adding 1 ensures the number of partitions is always greater than 1 (// performs floor division).

0 讨论(0)
发布评论:

提交评论
- 加载中...

热议问题