How do I install Python libraries automatically on Dataproc cluster startup?

风流意气都作罢 提交于 2019-12-19 07:04:13

问题


How can I automatically install Python libraries on my Dataproc cluster when the cluster starts? This would save me the trouble of manually logging into the master and/or worker nodes to manually install the libraries I need.

It would be great to also know if this automated installation could install things only on the master and not the workers.


回答1:


Initialization actions are the best way to do this. Initialization actions are shell scripts which are run when the cluster is created. This will let you customize the cluster, such as installing Python libraries. These scripts must be stored in Google Cloud Storage and can be used when creating clusters via the Google Cloud SDK or the Google Developers Console.

Here is a sample initialization action to install the Python pandas on cluster creation only on the master node.

#!/bin/sh
ROLE=$(/usr/share/google/get_metadata_value attributes/role)
if [[ "${ROLE}" == 'Master' ]]; then 
  apt-get install python-pandas -y
fi

As you can see from this script, it is possible to discern the role of a node with /usr/share/google/get_metadata_value attributes/role and then perform action specifically on the master (or worker) node.

You can see the Google Cloud Dataproc Documentation for more details



来源:https://stackoverflow.com/questions/32745868/how-do-i-install-python-libraries-automatically-on-dataproc-cluster-startup

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!