Dask dataframe split partitions based on a column or function

前端 未结 2 1051
情话喂你
情话喂你 2021-02-14 14:37

I have recently begun looking at Dask for big data. I have a question on efficiently applying operations in parallel.

Say I have some sales data like this:

cu         


        
2条回答
  •  后悔当初
    2021-02-14 15:05

    You can set your column to be the index

    df = df.set_index('customerKey')
    

    This will sort your data by that column and track which ranges of values are in which partition. As you note this is likely to be an expensive operation, you you'll probably want to save it somewhere

    Either in memory

    df = df.persist()
    

    or on disk

    df.to_parquet('...')
    df = df.read_parquet('...')
    

提交回复
热议问题