How to run multiple instances of Spark 2.0 at once (in multiple Jupyter Notebooks)?

前端 未结 1 1878
旧时难觅i
旧时难觅i 2021-01-15 04:27

I have a script which conveniently allows me to use Spark in a Jupyter Notebook. This is great, except when I run spark commands in a second notebook (for instance to test o

1条回答
  •  暖寄归人
    2021-01-15 04:48

    By default Spark runs on top of Hive and Hadoop, and stores its instructions for database transformations in Derby - a light weight database system. Derby can only run one Spark instance at a time, so when you start a second notebook and start running Spark commands, it crashes.

    To get around this you can connect Spark's Hive installation to Postgres instead of Derby.

    Brew install postgres, if you do not have it installed already.

    Then download postgresql-9.4.1212.jar (assuming you are running java 1.8 aka java8) from https://jdbc.postgresql.org/download.html

    Move this .jar file to the /libexec/jars/ directory for your Spark installation.

    ex: /usr/local/Cellar/apache-spark/2.0.1/

    (on Mac you can find where Spark is installed by typing brew info apache-spark in the command line)

    Next create hive-site.xml in the /libexec/conf directory for your Spark installation.

    ex: /usr/local/Cellar/apache-spark/2.0.1/libexec/conf

    This can be done through a text editor - just save the file with a '.xml' extension.

    hive-site.xml should contain the following text:

    
    
      javax.jdo.option.ConnectionURL
      jdbc:postgresql://localhost:5432/hive_metastore
    
    
    
      javax.jdo.option.ConnectionDriverName
      org.postgresql.Driver
    
    
    
    javax.jdo.option.ConnectionUserName
      hive
    
    
    
      javax.jdo.option.ConnectionPassword
      mypassword
    
    
    
    

    'hive' and 'mypassword' can be replaced with whatever makes sense to you - but must match with the next step.

    Finally create a user and password in Postgress: in the command line run the following commands -

    psql
    CREATE USER hive;
    ALTER ROLE hive WITH PASSWORD 'mypassword';
    CREATE DATABASE hive_metastore;
    GRANT ALL PRIVILEGES ON DATABASE hive_metastore TO hive;
    \q
    

    Thats it, you're done. Spark should now run in multiple Jupyter Notebooks simultaneously.

    0 讨论(0)
提交回复
热议问题