How do I install Python libraries automatically on Dataproc cluster startup?

烂漫一生 提交于 2019-12-01 05:01:27

Initialization actions are the best way to do this. Initialization actions are shell scripts which are run when the cluster is created. This will let you customize the cluster, such as installing Python libraries. These scripts must be stored in Google Cloud Storage and can be used when creating clusters via the Google Cloud SDK or the Google Developers Console.

Here is a sample initialization action to install the Python pandas on cluster creation only on the master node.

#!/bin/sh
ROLE=$(/usr/share/google/get_metadata_value attributes/role)
if [[ "${ROLE}" == 'Master' ]]; then 
  apt-get install python-pandas -y
fi

As you can see from this script, it is possible to discern the role of a node with /usr/share/google/get_metadata_value attributes/role and then perform action specifically on the master (or worker) node.

You can see the Google Cloud Dataproc Documentation for more details

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!