Apache beam DataFlow runner throwing setup error

问题

We are building data pipeline using Beam Python SDK and trying to run on Dataflow, but getting the below error,

A setup error was detected in beamapp-xxxxyyyy-0322102737-03220329-8a74-harness-lm6v. Please refer to the worker-startup log for detailed information.

But could not find detailed worker-startup logs.

We tried increasing memory size, worker count etc, but still getting the same error.

Here is the command we use,

python run.py \
--project=xyz \
--runner=DataflowRunner \
--staging_location=gs://xyz/staging \
--temp_location=gs://xyz/temp \
--requirements_file=requirements.txt \
--worker_machine_type n1-standard-8 \
--num_workers 2

pipeline snippet,

data = pipeline | "load data" >> beam.io.Read(    
    beam.io.BigQuerySource(query="SELECT * FROM abc_table LIMIT 100")
)

data | "filter data" >> beam.Filter(lambda x: x.get('column_name') == value)

Above pipeline is just loading the data from BigQuery and filtering based on some column value. This pipeline works like a charm in DirectRunner but fails on Dataflow.

Are we doing any obvious setup mistake? anyone else getting the same error? We could use some help to resolve the issue.

Update:

Our pipeline code is spread across multiple files, so we created a python package. We solved setup error problem by passing --setup_file argument instead of --requirements_file.

回答1:

We resolved this setup error issue by sending a different set of arguments to the dataflow. Our code is spread across multiple files, so had to create a package for it. If we use --requirements_file, the job will start, but fail eventually, because it wouldn't be able to find the package in the workers. Beam Python SDK sometimes does not throw explicit error message for these instead, it will retry the job and fail. To get your code running with a package, you will need to pass --setup_file argument, which has dependencies listed in it. Make sure package created by python setup.py sdist command includes all the files required by your pipeline code.

If you have a privately hosted python package dependency then pass --extra_package with the path to the package.tar.gz file. Better way is to store in a GCS bucket and pass the path here.

I have written an example project to get started with Apache Beam Python SDK on Dataflow - https://github.com/RajeshHegde/apache-beam-example

Read about it here - https://medium.com/@rajeshhegde/data-pipeline-using-apache-beam-python-sdk-on-dataflow-6bb8550bf366

回答2:

I'm building a prediction pipeline using Apache Beam/Dataflow. I need to include the model files inside the dependencies available to the remote workers. The Dataflow job failed with the same error log:

Error message from worker: A setup error was detected in beamapp-xxx-xxxxxxxxxx-xxxxxxxx-xxxx-harness-xxxx. Please refer to the worker-startup log for detailed information.

However, this error message didn't give any details about the worker-startup log. Finally, I found a way to have the worker log and solve the problem.

As is known, Dataflow creates compute engines to run jobs and save logs on them so that we can access the vm to see logs. We can connect to the vm in use by our Dataflow job from the GCP console via SSH. Then we can check the boot-json.log file located in /var/log/dataflow/taskrunner/harness:

$ cd /var/log/dataflow/taskrunner/harness
$ cat boot-json.log

Here we should pay attention. When running in batch mode, the vm created by Dataflow is ephemeral and closed when the job failed. If the vm is closed, we can't access it anymore. But a process including a failing item is retried 4 times, so normally we have enough time to open boot-json.log and see what is going on.

At last, I put my Python setup solution here that may help someone else:

main.py

...
model_path = os.path.dirname(os.path.abspath(__file__)) + '/models/net.pd'
# pipeline code
...

MANIFEST.in

include models/*.*

setup.py complete example

REQUIRED_PACKAGES = [...]

setuptools.setup(
    ...
    include_package_data=True,
    install_requires=REQUIRED_PACKAGES,
    packages=setuptools.find_packages()
)

Run Dataflow pipelines

$ python main.py --setup_file=/absolute/path/to/setup.py ...

来源：https://stackoverflow.com/questions/49442702/apache-beam-dataflow-runner-throwing-setup-error

标签

python

google-cloud-dataflow

apache-beam