Sorting in Dask

≯℡__Kan透↙ 提交于 2019-12-05 13:38:46

So far Dask does not seem to support sorting by multiple columns. However, making a new column that concatenates the values of the sorted columns may be a usable work-around.

d['new_column'] = d.apply(lambda r: str([r.col1,r.col2]), axis=1)
d = d.set_index('new_column')
d = d.map_partitions(lambda x: x.sort_index())

Edit: The above works if you want to sort by two strings. I recommend creating integer (or bytes) columns and then using struct.pack to create a new composite bytes column. For example, if col1_dt is a datetime and col2 is an integer:

import struct

# create a timedelta with seconds resolution. 
# i know this is the resolution is correct
d['col1_int'] = ((d['col1_dt'] -
                  d['col1_dt'].min())/np.timedelta64(1,'s')
                ).astype(int)

d['new_column'] = d.apply(lambda r: struct.pack("ll",r.col1_int,r.col2))
d = d.set_index('new_column')
d = d.map_partitions(lambda x: x.sort_index())
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!