I have recently begun looking at Dask for big data. I have a question on efficiently applying operations in parallel.
Say I have some sales data like this:
cu
You can set your column to be the index
df = df.set_index('customerKey')
This will sort your data by that column and track which ranges of values are in which partition. As you note this is likely to be an expensive operation, you you'll probably want to save it somewhere
Either in memory
df = df.persist()
or on disk
df.to_parquet('...')
df = df.read_parquet('...')