Python: Can I write to a file without loading its contents in RAM?

回眸只為那壹抹淺笑 提交于 2021-01-29 08:03:56

问题


Got a big data-set that I want to shuffle. The entire set won't fit into RAM so it would be good if I could open several files (e.g. hdf5, numpy) simultaneously, loop through my data chronologically and randomly assign each data-point to one of the piles (then afterwards shuffle each pile).

I'm really inexperienced with working with data in python so I'm not sure if it's possible to write to files without holding the rest of its contents in RAM (been using np.save and savez with little success).

Is this possible and in h5py or numpy and, if so, how could I do it?


回答1:


Memmory mapped files will allow for what you want. They create a numpy array which leaves the data on disk, only loading data as needed. The complete manual page is here. However, the easiest way to use them is by passing the argument mmap_mode=r+ or mmap_mode=w+ in the call to np.load leaves the file on disk (see here).

I'd suggest using advanced indexing. If you have data in a one dimensional array arr, you can index it using a list. So arr[ [0,3,5]] will give you the 0th, 3rd, and 5th elements of arr. That will make selecting the shuffled versions much easier. Since this will overwrite the data you'll need to open the files on disk read only, and create copies (using mmap_mode=w+) to put the shuffled data in.



来源:https://stackoverflow.com/questions/56955283/python-can-i-write-to-a-file-without-loading-its-contents-in-ram

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!