I have a script which conveniently allows me to use Spark in a Jupyter Notebook. This is great, except when I run spark commands in a second notebook (for instance to test o
By default Spark runs on top of Hive and Hadoop, and stores its instructions for database transformations in Derby - a light weight database system. Derby can only run one Spark instance at a time, so when you start a second notebook and start running Spark commands, it crashes.
To get around this you can connect Spark's Hive installation to Postgres instead of Derby.
Brew install postgres, if you do not have it installed already.
Then download postgresql-9.4.1212.jar
(assuming you are running java 1.8 aka java8)
from https://jdbc.postgresql.org/download.html
Move this .jar file to the /libexec/jars/
directory for your Spark installation.
ex: /usr/local/Cellar/apache-spark/2.0.1/
(on Mac you can find where Spark is installed by typing brew info apache-spark
in the command line)
Next create hive-site.xml in the /libexec/conf
directory for your Spark installation.
ex: /usr/local/Cellar/apache-spark/2.0.1/libexec/conf
This can be done through a text editor - just save the file with a '.xml' extension.
hive-site.xml should contain the following text:
javax.jdo.option.ConnectionURL
jdbc:postgresql://localhost:5432/hive_metastore
javax.jdo.option.ConnectionDriverName
org.postgresql.Driver
javax.jdo.option.ConnectionUserName
hive
javax.jdo.option.ConnectionPassword
mypassword
'hive' and 'mypassword' can be replaced with whatever makes sense to you - but must match with the next step.
Finally create a user and password in Postgress: in the command line run the following commands -
psql
CREATE USER hive;
ALTER ROLE hive WITH PASSWORD 'mypassword';
CREATE DATABASE hive_metastore;
GRANT ALL PRIVILEGES ON DATABASE hive_metastore TO hive;
\q
Thats it, you're done. Spark should now run in multiple Jupyter Notebooks simultaneously.