问题
I have seen the instructions here https://cloud.google.com/dataproc/docs/tutorials/jupyter-notebook for setting up Jupyter notebooks with dataproc but I can't figure out how to alter the process in order to use Cloud shell instead of creating an SSH tunnel locally. I have been able to connect to a datalab notebook by running
datalab connect vmname
from the cloud shell and then using the preview function. I would like to do something similar but with Jupyter notebooks and a dataproc cluster.
回答1:
In theory, you can mostly follow the same instructions as found https://cloud.google.com/shell/docs/features#web_preview to use local port forwarding to access your Jupyter notebooks on Dataproc via the Cloud Shell's same "web preview" feature. Something like the following in your cloud shell:
gcloud compute ssh my-cluster-m -- -L 8080:my-cluster-m:8123
However, there are two issues which prevent this from working:
You need to modify the Jupyter config to add the following to the bottom of
/root/.jupyter/jupyter_notebook_config.py
:c.NotebookApp.allow_origin = '*'
Cloud Shell's web preview needs to add support for websockets.
If you don't do (1) then you'll get popup errors when trying to create a notebook, due to Jupyter refusing the cloud shell proxy domain. Unfortunately (2) requires deeper support from Cloud Shell itself; it'll manifest as errors like A connection to the notebook server could not be established.
Another possible option without waiting for (2) is to run your own nginx proxy as part of the jupyter initialization action on a Dataproc cluster, if you can get it to proxy websockets suitably. See this thread for a similar situation: https://github.com/jupyter/notebook/issues/1311
Generally this type of broken websocket support in proxy layers is a common problem since it's still relatively new; over time more and more things will start to support websockets out of the box.
Alternatively:
Dataproc also supports using a Datalab initialization action; this is set up such that the websockets proxying is already taken care of. Thus, if you're not too dependent on just Jupyter specifically, then the following works in cloud shell:
gcloud dataproc clusters create my-datalab-cluster \
--initialization-actions gs://dataproc-initialization-actions/datalab/datalab.sh
gcloud compute ssh my-datalab-cluster-m -- -L 8080:my-datalab-cluster-m:8080
And then select the usual "Web Preview" on port 8080. Or you can select other Cloud Shell supported ports for the local binding like:
gcloud compute ssh my-datalab-cluster-m -- -L 8082:my-datalab-cluster-m:8080
In which case you'd select 8082
as the web preview port.
回答2:
You can't connect to Dataproc through a Datalab installed on a VM (on a GCE).
As the documentation you mentionned, you must launch a Dataproc with a Datalab Initialization Action.
Moreover the Datalab connect
command is only available if you have created a Datalab thanks to the Datalab create
command.
You must create a SSH tunnel to your master node ("vmname-m" if your cluster name is "vmname") with:
gcloud compute ssh --zone YOUR-ZONE --ssh-flag="-D 1080" --ssh-flag="-N" --ssh-flag="-n" "vmname-m"
来源:https://stackoverflow.com/questions/43402138/how-do-i-connect-to-a-dataproc-cluster-with-jupyter-notebooks-from-cloud-shell