问题
Got a big data-set that I want to shuffle. The entire set won't fit into RAM so it would be good if I could open several files (e.g. hdf5, numpy) simultaneously, loop through my data chronologically and randomly assign each data-point to one of the piles (then afterwards shuffle each pile).
I'm really inexperienced with working with data in python so I'm not sure if it's possible to write to files without holding the rest of its contents in RAM (been using np.save and savez with little success).
Is this possible and in h5py or numpy and, if so, how could I do it?
回答1:
Memmory mapped files will allow for what you want. They create a numpy array which leaves the data on disk, only loading data as needed. The complete manual page is here. However, the easiest way to use them is by passing the argument mmap_mode=r+
or mmap_mode=w+
in the call to np.load
leaves the file on disk (see here).
I'd suggest using advanced indexing. If you have data in a one dimensional array arr
, you can index it using a list. So arr[ [0,3,5]]
will give you the 0th, 3rd, and 5th elements of arr
. That will make selecting the shuffled versions much easier. Since this will overwrite the data you'll need to open the files on disk read only, and create copies (using mmap_mode=w+
) to put the shuffled data in.
来源:https://stackoverflow.com/questions/56955283/python-can-i-write-to-a-file-without-loading-its-contents-in-ram