Distributing Python module - Spark vs Process Pools

丶灬走出姿态 提交于 2019-12-11 15:57:29

问题


I've made a Python module that extracts handwritten text from PDFs. The extraction can sometimes be quite slow (20-30 seconds per file). I have around 100,000 PDFs (some with lots of pages) and I want to run the text extraction on all of them. Essentially something like this:

fileNameList = ['file1.pdf','file2.pdf',...,'file100000.pdf']

for pdf in fileList:
    text = myModule.extractText(pdf) # Distribute this function
    # Do stuff with text

We used Spark once before (a coworker, not me) to distribute indexing a few million files from an SQL DB into Solr across a few servers, however when researching this it seems that Spark is more for parallelizing large data sets, not so much distributing a single task. For that it looks like Python's inbuilt 'Process Pools' module would be better, and I can just run that on a single server with like 4 CPU cores.

I know SO is more for specific problems, but was just wanting some advice before I go down the entirely wrong road. For my use case should I stick to a single server with Process Pools, or split it across multiple servers with Spark?


回答1:


This is perfectly reasonable to use Spark for since you can distribute the task of text extraction across multiple executors by placing the files on distributed storage. This would let you scale out your compute to process the files and write the results back out very efficiently and easily with pySpark. You could even use your existing Python text extraction code:

input = sc.binaryFiles("/path/to/files")
processed = input.map(lambda (filename, content): (filename, myModule.extract(content)))

As your data volume increases or you wish to increase your throughput you can simply add additional nodes.



来源:https://stackoverflow.com/questions/48142859/distributing-python-module-spark-vs-process-pools

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!