I\'m trying to connect to Amazon Redshift via Spark, so I can join data we have on S3 with data on our RS cluster. I found some very spartan documentation here for the capab
You first need to download Postgres JDBC driver. You can find it here: https://jdbc.postgresql.org/
You can either define your environment variable SPARK_CLASSPATH in .bashrc
, conf/spark-env.sh or similar file or specify it in the script before you run your IPython notebook.
You can also define it in your conf/spark-defaults.conf in the following way:
spark.driver.extraClassPath /path/to/file/postgresql-9.4-1201.jdbc41.jar
Make sure it is reflected in the Environment tab of your Spark WebUI.
You will also need to set appropriate AWS credentials in the following way:
sparkContext.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", "***")
sparkContext.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", "***")