How to run multiple instances of Spark 2.0 at once (in multiple Jupyter Notebooks)?

前端未结

关注

 1  1878

旧时难觅i 2021-01-15 04:27

I have a script which conveniently allows me to use Spark in a Jupyter Notebook. This is great, except when I run spark commands in a second notebook (for instance to test o

1条回答

暖寄归人 (楼主)

2021-01-15 04:48
By default Spark runs on top of Hive and Hadoop, and stores its instructions for database transformations in Derby - a light weight database system. Derby can only run one Spark instance at a time, so when you start a second notebook and start running Spark commands, it crashes.

To get around this you can connect Spark's Hive installation to Postgres instead of Derby.

Brew install postgres, if you do not have it installed already.

Then download postgresql-9.4.1212.jar (assuming you are running java 1.8 aka java8) from https://jdbc.postgresql.org/download.html

Move this .jar file to the /libexec/jars/ directory for your Spark installation.

ex: /usr/local/Cellar/apache-spark/2.0.1/

(on Mac you can find where Spark is installed by typing brew info apache-spark in the command line)

Next create hive-site.xml in the /libexec/conf directory for your Spark installation.

ex: /usr/local/Cellar/apache-spark/2.0.1/libexec/conf

This can be done through a text editor - just save the file with a '.xml' extension.

hive-site.xml should contain the following text:
```
  javax.jdo.option.ConnectionURL
  jdbc:postgresql://localhost:5432/hive_metastore



  javax.jdo.option.ConnectionDriverName
  org.postgresql.Driver



javax.jdo.option.ConnectionUserName
  hive



  javax.jdo.option.ConnectionPassword
  mypassword
```
'hive' and 'mypassword' can be replaced with whatever makes sense to you - but must match with the next step.

Finally create a user and password in Postgress: in the command line run the following commands -
```
psql
CREATE USER hive;
ALTER ROLE hive WITH PASSWORD 'mypassword';
CREATE DATABASE hive_metastore;
GRANT ALL PRIVILEGES ON DATABASE hive_metastore TO hive;
\q
```
Thats it, you're done. Spark should now run in multiple Jupyter Notebooks simultaneously.
0 讨论(0)
发布评论:

提交评论
- 加载中...