Google Cloud DataFlow job throws alert after few hours

坚强是说给别人听的谎言 提交于 2021-01-28 05:43:41

问题


Running a DataFlow streaming job using 2.11.0 release. I get the following authentication error after few hours:

File "streaming_twitter.py", line 188, in <lambda> 
File "streaming_twitter.py", line 102, in estimate 
File "streaming_twitter.py", line 84, in estimate_aiplatform 
File "streaming_twitter.py", line 42, in get_service 
File "/usr/local/lib/python2.7/dist-packages/googleapiclient/_helpers.py", line 130, in positional_wrapper return wrapped(*args, **kwargs) 
File "/usr/local/lib/python2.7/dist-packages/googleapiclient/discovery.py", line 227, in build credentials=credentials) 
File "/usr/local/lib/python2.7/dist-packages/googleapiclient/_helpers.py", line 130, in positional_wrapper return wrapped(*args, **kwargs) 
File "/usr/local/lib/python2.7/dist-packages/googleapiclient/discovery.py", line 363, in build_from_document credentials = _auth.default_credentials() 
File "/usr/local/lib/python2.7/dist-packages/googleapiclient/_auth.py", line 42, in default_credentials credentials, _ = google.auth.default() 
File "/usr/local/lib/python2.7/dist-packages/google/auth/_default.py", line 306, in default raise exceptions.DefaultCredentialsError(_HELP_MESSAGE) DefaultCredentialsError: Could not automatically determine credentials. Please set GOOGLE_APPLICATION_CREDENTIALS or explicitly create credentials and re-run the application. 

This Dataflow job performs an API request to AI Platform prediction and seems to be Authentication token is expiring.

Code snippet:

def get_service():
    # If it hasn't been instantiated yet: do it now
    return discovery.build('ml', 'v1',
                           discoveryServiceUrl=DISCOVERY_SERVICE,
                           cache_discovery=True)

I tried adding the following lines to the service function:

    os.environ[
        "GOOGLE_APPLICATION_CREDENTIALS"] = "/tmp/key.json"

But I get:

DefaultCredentialsError: File "/tmp/key.json" was not found. [while running 'generatedPtransform-930']

I assume because file is not in DataFlow machine. Other option is to use developerKey param in build method, but doesnt seems supported by AI Platform prediction, I get error:

Expected OAuth 2 access token, login cookie or other valid authentication credential. See https://developers.google.com/identity/sign-in/web/devconsole-project."> [while running 'generatedPtransform-22624']

Looking to understand how to fix it and what is the best practice?

Any suggestions?

  • Complete logs here
  • Complete code here

回答1:


Setting os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = '/tmp/key.json' only works locally with the DirectRunner. Once deploying to a distributed runner like Dataflow, each worker won't be able to find the local file /tmp/key.json.

If you want each worker to use a specific service account, you can tell Beam which service account to use to identify workers.

First, grant the roles/dataflow.worker role to the service account you want your workers to use. There is no need to download the service account key file :)

Then if you're letting PipelineOptions parse your command line arguments, you can simply use the service_account_email option, and specify it like --service_account_email your-email@your-project.iam.gserviceaccount.com when running your pipeline.

The service account pointed by your GOOGLE_APPLICATION_CREDENTIALS is simply used to start the job, but each worker uses the service account specified by the service_account_email. If a service_account_email is not passed, it defaults to the email from your GOOGLE_APPLICATION_CREDENTIALS file.



来源:https://stackoverflow.com/questions/58723809/google-cloud-dataflow-job-throws-alert-after-few-hours

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!