How do I run pyspark with jupyter notebook?

问题

I am trying to fire the jupyter notebook when I run the command pyspark in the console. When I type it now, it only starts and interactive shell in the console. However, this is not convenient to type long lines of code. Is there are way to connect the jupyter notebook to pyspark shell? Thanks.

回答1:

Assuming you have Spark installed wherever you are going to run Jupyter, I'd recommend you use findspark. Once you pip install findspark, you can just

import findspark
findspark.init()

import pyspark
sc = pyspark.SparkContext(appName="myAppName")

... and go

回答2:

I'm assuming you already have spark and jupyter notebooks installed and they work flawlessly independent of each other.

If that is the case, then follow the steps below and you should be able to fire up a jupyter notebook with a (py)spark backend.

Go to your spark installation folder and there should be a bin directory there: /path/to/spark/bin
Create a file, let's call it start_pyspark.sh

Open start_pyspark.sh and write something like:

    #!/bin/bash

export PYSPARK_PYTHON=/path/to/anaconda3/bin/python
export PYSPARK_DRIVER_PYTHON=/path/to/anaconda3/bin/jupyter
export PYSPARK_DRIVER_PYTHON_OPTS="notebook --NotebookApp.open_browser=False --NotebookApp.ip='*' --NotebookApp.port=8880"

pyspark "$@"

Replace the /path/to ... with the path where you have installed your python and jupyter binaries respectively.

Most probably this step is already done, but just in case
Modify your ~/.bashrc file by adding the following lines

    # Spark
    export PATH="/path/to/spark/bin:/path/to/spark/sbin:$PATH"
    export SPARK_HOME="/path/to/spark"
    export SPARK_CONF_DIR="/path/to/spark/conf"

Run source ~/.bashrc and you are set.

Go ahead and try start_pyspark.sh.
You could also give arguments to the script, something like start_pyspark.sh --packages dibbhatt:kafka-spark-consumer:1.0.14.

Hope it works out for you.

回答3:

cd project-folder/
pip install virtualenv
virtualenv venv

This should create a folder "venv/" inside your project folder.

Run the virtualenv by typing

source venv/bin/activate
pip install jupyter

This should start your virtualenv. Then go to ~/.bash_profile and type

export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS='notebook'

Then type source ~/.bash_profile in the console. You should be good to go after this. If you type pyspark in the console, a jupyter notebook will fire-up

You can also check that object sqlConnector is available in your notebook by typing sqlConnector and executing the ipython notebook cell.

回答4:

Save your self a lot of configuration headaches, just run a Docker container: https://hub.docker.com/r/jupyter/all-spark-notebook/

回答5:

Download spark from the website i have downloaded spark-2.2.0-bin-hadoop2.7,jupyter-notebook

mak@mak-Aspire-A515-51G:~$ chmod -R 777 spark-2.2.0-bin-hadoop2.7
mak@mak-Aspire-A515-51G:~$ export SPARK_HOME='/home/mak/spark-2.2.0-bin- 
hadoop2.7'
mak@mak-Aspire-A515-51G:~$ export PATH=$SPARK_HOME:$PATH
mak@mak-Aspire-A515-51G:~$ export PYTHONPATH=$SPARK_HOME/python:$PYTHONPATH
mak@mak-Aspire-A515-51G:~$ export PYSPARK_DRIVER_PYTHON="jupyter"
mak@mak-Aspire-A515-51G:~$ export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
mak@mak-Aspire-A515-51G:~$ export PYSPARK_PYTHON=python3

Go to spark directory and open python3 and import spark it will be succesful.

 mak@mak-Aspire-A515-51G:~/spark-2.2.0-bin-hadoop2.7/python$ python3
 Python 3.6.4 |Anaconda, Inc.| (default, Jan 16 2018, 18:10:19) 
 [GCC 7.2.0] on linux
 Type "help", "copyright", "credits" or "license" for more information.
 >>> import pyspark

mak@mak-Aspire-A515-51G:~/spark-2.2.0-bin-hadoop2.7/python$ jupyter-notebook
import pyspark

if you want to open jupyter from the outside of Spark directory then you need to follow the below steps

mak@mak-Aspire-A515-51G:~$ pip3 install findspark

mak@mak-Aspire-A515-51G:~$ python
Python 3.6.4 |Anaconda, Inc.| (default, Jan 16 2018, 18:10:19) 
[GCC 7.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyspark
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'pyspark'
>>> import findspark
>>> findspark.init('/home/mak/spark-2.2.0-bin-hadoop2.7')
>>> import pyspark


mak@mak-Aspire-A515-51G:~$ jupyter-notebook 
import findspark
findspark.init('/home/mak/spark-2.2.0-bin-hadoop2.7')
import pyspark

来源：https://stackoverflow.com/questions/48915274/how-do-i-run-pyspark-with-jupyter-notebook

标签

apache-spark

pyspark

pyspark-sql