Distributing Python module - Spark vs Process Pools
问题 I've made a Python module that extracts handwritten text from PDFs. The extraction can sometimes be quite slow (20-30 seconds per file). I have around 100,000 PDFs (some with lots of pages) and I want to run the text extraction on all of them. Essentially something like this: fileNameList = ['file1.pdf','file2.pdf',...,'file100000.pdf'] for pdf in fileList: text = myModule.extractText(pdf) # Distribute this function # Do stuff with text We used Spark once before (a coworker, not me) to