How do I connect to a dataproc cluster with Jupyter notebooks from cloud shell

生来就可爱ヽ(ⅴ<●) 提交于 2019-12-08 04:05:51

问题


I have seen the instructions here https://cloud.google.com/dataproc/docs/tutorials/jupyter-notebook for setting up Jupyter notebooks with dataproc but I can't figure out how to alter the process in order to use Cloud shell instead of creating an SSH tunnel locally. I have been able to connect to a datalab notebook by running

datalab connect vmname 

from the cloud shell and then using the preview function. I would like to do something similar but with Jupyter notebooks and a dataproc cluster.


回答1:


In theory, you can mostly follow the same instructions as found https://cloud.google.com/shell/docs/features#web_preview to use local port forwarding to access your Jupyter notebooks on Dataproc via the Cloud Shell's same "web preview" feature. Something like the following in your cloud shell:

gcloud compute ssh my-cluster-m -- -L 8080:my-cluster-m:8123

However, there are two issues which prevent this from working:

  1. You need to modify the Jupyter config to add the following to the bottom of /root/.jupyter/jupyter_notebook_config.py:

    c.NotebookApp.allow_origin = '*'
    
  2. Cloud Shell's web preview needs to add support for websockets.

If you don't do (1) then you'll get popup errors when trying to create a notebook, due to Jupyter refusing the cloud shell proxy domain. Unfortunately (2) requires deeper support from Cloud Shell itself; it'll manifest as errors like A connection to the notebook server could not be established.

Another possible option without waiting for (2) is to run your own nginx proxy as part of the jupyter initialization action on a Dataproc cluster, if you can get it to proxy websockets suitably. See this thread for a similar situation: https://github.com/jupyter/notebook/issues/1311

Generally this type of broken websocket support in proxy layers is a common problem since it's still relatively new; over time more and more things will start to support websockets out of the box.

Alternatively:

Dataproc also supports using a Datalab initialization action; this is set up such that the websockets proxying is already taken care of. Thus, if you're not too dependent on just Jupyter specifically, then the following works in cloud shell:

gcloud dataproc clusters create my-datalab-cluster \
    --initialization-actions gs://dataproc-initialization-actions/datalab/datalab.sh
gcloud compute ssh my-datalab-cluster-m -- -L 8080:my-datalab-cluster-m:8080

And then select the usual "Web Preview" on port 8080. Or you can select other Cloud Shell supported ports for the local binding like:

gcloud compute ssh my-datalab-cluster-m -- -L 8082:my-datalab-cluster-m:8080

In which case you'd select 8082 as the web preview port.




回答2:


You can't connect to Dataproc through a Datalab installed on a VM (on a GCE).

As the documentation you mentionned, you must launch a Dataproc with a Datalab Initialization Action.

Moreover the Datalab connect command is only available if you have created a Datalab thanks to the Datalab create command.

You must create a SSH tunnel to your master node ("vmname-m" if your cluster name is "vmname") with:

gcloud compute ssh --zone YOUR-ZONE --ssh-flag="-D 1080" --ssh-flag="-N" --ssh-flag="-n" "vmname-m"


来源:https://stackoverflow.com/questions/43402138/how-do-i-connect-to-a-dataproc-cluster-with-jupyter-notebooks-from-cloud-shell

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!