Dask dataframe split partitions based on a column or function

前端 未结 2 1062
情话喂你
情话喂你 2021-02-14 14:37

I have recently begun looking at Dask for big data. I have a question on efficiently applying operations in parallel.

Say I have some sales data like this:

cu         


        
相关标签:
2条回答
  • 2021-02-14 15:05

    You can set your column to be the index

    df = df.set_index('customerKey')
    

    This will sort your data by that column and track which ranges of values are in which partition. As you note this is likely to be an expensive operation, you you'll probably want to save it somewhere

    Either in memory

    df = df.persist()
    

    or on disk

    df.to_parquet('...')
    df = df.read_parquet('...')
    
    0 讨论(0)
  • 2021-02-14 15:07

    Setting index to the required column and map_partitions works much efficient compared to groupby

    0 讨论(0)
提交回复
热议问题