Dataflow Apache beam Python job stuck at Group by step

问题

I am running a dataflow job, which readed from BigQuery and scans around 8 GB of data and result in more than 50,000,000 records. Now at group by step I want to group based on a key and one column need to be concatenated . But After concatenated size of concatenated column becomes more than 100 MB that why I have to do that group by in dataflow job because that group by can not be done in Bigquery level due to row size limit of 100 MB.

Now the dataflow job scales well when reading from BigQuery but stuck at Group by Step , I have 2 version of dataflow code, but both are stucking at group by step. When I checked the stack driver logs, it says, processing stuck at lull for more than 1010 sec time(similar kind of message) and Refusing to split GroupedShuffleReader <dataflow_worker.shuffle.GroupedShuffleReader object at 0x7f618b406358> kind of message

I expect the group by state to be completed within 20 mins but is stuck for more than 1 hours and never gets finished

回答1:

I figured out the thing myself. Below are the 2 changes that I did in my pipeline: 1. I added a Combine function just after the Group by Key, see screenshot

since the Group by key when running on multiple worker, does a lot of network traffic exchange, and by default the network we use, does not allow the inter network communication, so I have to create a firewall rule to allow traffic from one worker to another worker i.e. ip range to network traffic.

来源：https://stackoverflow.com/questions/57545513/dataflow-apache-beam-python-job-stuck-at-group-by-step

标签

google-bigquery

apache-beam

dataflow