I have the following problem:
I have a set several hdf5 files with similar data frames which I want to sort globally based on multiple columns.
My input is the file names and an ordered list of columns I want to use for sorting. The output should be a single hdf5 file containing all the sorted data.
Each file can contain millions of rows. I can afford loading a single file in memory but not the entire dataset.
Naively I would like first to copy all the data in a single hdf5 file (which is not difficult) and then find out a way to do in memory sorting of this huge file.
Is there a quick way to sort in memory a pandas datastructure stored in an hdf5 file based on multiple columns?
I have already seen ptrepack but it seems to allow you sorting only on a single column.
来源:https://stackoverflow.com/questions/24526254/pandas-in-memory-sorting-hdf5-files