How to read / process large files in parallel with Python

问题

I have a large file almost 20GB, more than 20 mln lines and each line represents separate serialized JSON.

Reading file line by line as a regular loop and performing manipulation on line data takes a lot of time.

Is there any state of art approach or best practices for reading large files in parallel with smaller chunks in order to make processing faster?

I'm using Python 3.6.X

回答1:

Unfortunately, no. Reading in files and operating on the lines read (such as json parsing or computation) is a CPU-bound operation, so there's no clever asyncio tactics to speed it up. In theory one could utilize multiprocessing and multiple cores to read and process in parallel, but having multiple threads reading the same file is bound to cause major problems. Because your file is so large, storing it all in memory and then parallelizing the computation is also going to be difficult.

Your best bet would be to head this problem off at the pass by partitioning the data (if possible) into multiple files, which could then open up safer doors to parallelism with multiple cores. Sorry there isn't a better answer AFAIK.

回答2:

There are several possibilites, but first profile your code in find the bottlenecks. Maybe your processing does some slows things which can be speed up - which would be vastly preferable to multiprocessing. If that does not help, you could try:

Use another file format. Reading serialized json from text is not the fastest operation in the world. So you could store your data (for example in hdf5) which could speed up processing.
Implement multiple worker processes which can read portions of the file (worker1 reads lines 0 - 1million, worker2 1million - 2million etc). You can orchestrate that with joblib or celery, depending on your needs. Integrating the results is the challenge, there you have to see what your needs are (map-reduce style?). This is more difficult in python due to no real threading than in other languages, so maybe you could switch the language for that.

来源：https://stackoverflow.com/questions/50636059/how-to-read-process-large-files-in-parallel-with-python

标签

python

multithreading

python-3.x