Does it make sense to use Google DataFlow/Apache Beam to parallelize image processing or crawling tasks?

半城伤御伤魂 提交于 2020-01-02 05:47:26

问题


I am considering Google DataFlow as an option for running a pipeline that involves steps like:

  1. Downloading images from the web;
  2. Processing images.

I like that DataFlow manages the lifetime of VMs required to complete the job, so I don't need to start or stop them myself, but all examples I came across use it for data mining kind of tasks. I wonder if it is a viable option for other batch tasks like image processing and crawling.


回答1:


This use case is a possible application for Dataflow/Beam.

If you want to do this in a streaming fashion, you could have a crawler generating URLs and adding them to a PubSub or Kafka queue; and code a Beam pipeline to do the following:

  1. Read from PubSub
  2. Download the website content in a ParDo
  3. Parse image URLs from the website in another ParDo*
  4. Download each image and process it, again with a ParDo
  5. Store the result in GCS, BigQuery, or others, depending on what information you want from the image.

You can do the same with a batch job, just changing the source you're reading the URLs from.

*After parsing those image URLs, you may also want to reshuffle your data, to gain some parallelism.



来源:https://stackoverflow.com/questions/44621488/does-it-make-sense-to-use-google-dataflow-apache-beam-to-parallelize-image-proce

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!