I see the paramter npartitions
in many functions, but I don\'t understand what it is good for / used for.
http://dask.pydata.org/en/latest/dataframe-api.html
The npartitions
property is the number of Pandas dataframes that compose a single Dask dataframe. This affects performance in two main ways.
Generally you want a few times more partitions than you have cores. Every task takes up a few hundred microseconds in the scheduler.
You can determine the number of partitions either at data ingestion time using the parameters like blocksize=
in read_csv(...)
or afterwards by using the .repartition(...)
method.