Dataflow Pipeline Slow

此生再无相见时 提交于 2019-12-24 10:12:50

问题


My Dataflow pipeline is running extremely slow. Its processing approximately 4 elements/2 with 30 worker threads. A single local machine running the same operations (but not in the dataflow framework) is able to process 7 elements/s. The script is written in Python. Data is read from BigQuery.

The workers are n1-standard, and all look to be at 100% CPU utilization.

The operations contained within the combine are:

  1. tokenizes the record and applies stop word filtering (nltk)
  2. stem the word (nltk)
  3. lookup the word in a dictionary
  4. increment the count of said word in a dictionary

Each record is approximately 30-70 KB. Total number of records is ~ 9,000,000 from BigQuery (Log shows all records have been exported successfully)

With 30 worker threads, I expect the throughput to be a lot higher than my single local machine and certainly not half as fast.

What could the problem be?


回答1:


After some performance profiling and testing multiple sized datasets, it appears that this is probably a problem with the huge size of the dictionaries. i.e. The pipeline works fine for thousands (with throughput closer to 40 elements/s), but breaks on millions. I'm closing this topic, as the rabbit hole goes deeper.

Since this is a problem with a specific use case, I thought it would not be relevant to continue on this thread. If you want to follow me on my adventure, the followup questions reside here



来源:https://stackoverflow.com/questions/48680416/dataflow-pipeline-slow

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!