问题
I am trying to fire the jupyter notebook when I run the command pyspark
in the console. When I type it now, it only starts and interactive shell in the console. However, this is not convenient to type long lines of code. Is there are way to connect the jupyter notebook to pyspark shell? Thanks.
回答1:
Assuming you have Spark installed wherever you are going to run Jupyter, I'd recommend you use findspark. Once you pip install findspark
, you can just
import findspark
findspark.init()
import pyspark
sc = pyspark.SparkContext(appName="myAppName")
... and go
回答2:
I'm assuming you already have spark and jupyter notebooks installed and they work flawlessly independent of each other.
If that is the case, then follow the steps below and you should be able to fire up a jupyter notebook with a (py)spark backend.
Go to your spark installation folder and there should be a
bin
directory there:/path/to/spark/bin
Create a file, let's call it
start_pyspark.sh
Open
start_pyspark.sh
and write something like:#!/bin/bash
export PYSPARK_PYTHON=/path/to/anaconda3/bin/python export PYSPARK_DRIVER_PYTHON=/path/to/anaconda3/bin/jupyter export PYSPARK_DRIVER_PYTHON_OPTS="notebook --NotebookApp.open_browser=False --NotebookApp.ip='*' --NotebookApp.port=8880" pyspark "$@"
Replace the /path/to ...
with the path where you have installed your python and jupyter binaries respectively.
Most probably this step is already done, but just in case
Modify your~/.bashrc
file by adding the following lines# Spark export PATH="/path/to/spark/bin:/path/to/spark/sbin:$PATH" export SPARK_HOME="/path/to/spark" export SPARK_CONF_DIR="/path/to/spark/conf"
Run source ~/.bashrc
and you are set.
Go ahead and try start_pyspark.sh
.
You could also give arguments to the script, something like
start_pyspark.sh --packages dibbhatt:kafka-spark-consumer:1.0.14
.
Hope it works out for you.
回答3:
cd project-folder/
pip install virtualenv
virtualenv venv
This should create a folder "venv/" inside your project folder.
Run the virtualenv by typing
source venv/bin/activate
pip install jupyter
This should start your virtualenv. Then go to ~/.bash_profile and type
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS='notebook'
Then type source ~/.bash_profile
in the console.
You should be good to go after this.
If you type pyspark
in the console, a jupyter notebook will fire-up
You can also check that object sqlConnector
is available in your notebook by typing sqlConnector
and executing the ipython notebook cell.
回答4:
Save your self a lot of configuration headaches, just run a Docker container: https://hub.docker.com/r/jupyter/all-spark-notebook/
回答5:
Download spark from the website i have downloaded spark-2.2.0-bin-hadoop2.7,jupyter-notebook
mak@mak-Aspire-A515-51G:~$ chmod -R 777 spark-2.2.0-bin-hadoop2.7
mak@mak-Aspire-A515-51G:~$ export SPARK_HOME='/home/mak/spark-2.2.0-bin-
hadoop2.7'
mak@mak-Aspire-A515-51G:~$ export PATH=$SPARK_HOME:$PATH
mak@mak-Aspire-A515-51G:~$ export PYTHONPATH=$SPARK_HOME/python:$PYTHONPATH
mak@mak-Aspire-A515-51G:~$ export PYSPARK_DRIVER_PYTHON="jupyter"
mak@mak-Aspire-A515-51G:~$ export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
mak@mak-Aspire-A515-51G:~$ export PYSPARK_PYTHON=python3
Go to spark directory and open python3 and import spark it will be succesful.
mak@mak-Aspire-A515-51G:~/spark-2.2.0-bin-hadoop2.7/python$ python3
Python 3.6.4 |Anaconda, Inc.| (default, Jan 16 2018, 18:10:19)
[GCC 7.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyspark
mak@mak-Aspire-A515-51G:~/spark-2.2.0-bin-hadoop2.7/python$ jupyter-notebook
import pyspark
if you want to open jupyter from the outside of Spark directory then you need to follow the below steps
mak@mak-Aspire-A515-51G:~$ pip3 install findspark
mak@mak-Aspire-A515-51G:~$ python
Python 3.6.4 |Anaconda, Inc.| (default, Jan 16 2018, 18:10:19)
[GCC 7.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyspark
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'pyspark'
>>> import findspark
>>> findspark.init('/home/mak/spark-2.2.0-bin-hadoop2.7')
>>> import pyspark
mak@mak-Aspire-A515-51G:~$ jupyter-notebook
import findspark
findspark.init('/home/mak/spark-2.2.0-bin-hadoop2.7')
import pyspark
来源:https://stackoverflow.com/questions/48915274/how-do-i-run-pyspark-with-jupyter-notebook