What is the role of npartitions in a Dask dataframe?

前端 未结 1 1124
挽巷
挽巷 2021-02-19 11:35

I see the paramter npartitions in many functions, but I don\'t understand what it is good for / used for.

http://dask.pydata.org/en/latest/dataframe-api.html

1条回答
  •  不要未来只要你来
    2021-02-19 12:11

    The npartitions property is the number of Pandas dataframes that compose a single Dask dataframe. This affects performance in two main ways.

    1. If you don't have enough partitions then you may not be able to use all of your cores effectively. For example if your dask.dataframe has only one partition then only one core can operate at a time.
    2. If you have too many partitions then the scheduler may incur a lot of overhead deciding where to compute each task.

    Generally you want a few times more partitions than you have cores. Every task takes up a few hundred microseconds in the scheduler.

    You can determine the number of partitions either at data ingestion time using the parameters like blocksize= in read_csv(...) or afterwards by using the .repartition(...) method.

    0 讨论(0)
提交回复
热议问题