I\'m trying to connect to Amazon Redshift via Spark, so I can join data we have on S3 with data on our RS cluster. I found some very spartan documentation here for the capab
The simplest way to make a jdbc connection to Redshift using python is as follows:
# -*- coding: utf-8 -*-
from pyspark.sql import SparkSession
jdbc_url = "jdbc:redshift://xxx.xxx.redshift.amazonaws.com:5439/xxx"
jdbc_user = "xxx"
jdbc_password = "xxx"
jdbc_driver = "com.databricks.spark.redshift"
spark = SparkSession.builder.master("yarn") \
.config("hive.metastore.client.factory.class", "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory") \
.enableHiveSupport().getOrCreate()
# Read data from a query
df = spark.read \
.format(jdbc_driver) \
.option("url", jdbc_url + "?user="+ jdbc_user +"&password="+ jdbc_password) \
.option("query", "your query") \
.load()
This worked for in Scala in AWS Glue with Spark 2.4:
val spark: SparkContext = new SparkContext()
val glueContext: GlueContext = new GlueContext(spark)
Job.init(args("JOB_NAME"), glueContext, args.asJava)
val sqlContext = new org.apache.spark.sql.SQLContext(spark)
val jdbcDF = sqlContext.read.format("jdbc").options(
Map("url" -> "jdbc:postgresql://HOST:PORT/DBNAME?user=USERNAME&password=PASSWORD",
"dbtable" -> "(SELECT a.row_name FROM schema_name.table_name a) as from_redshift")).load()
// back to DynamicFrame
val datasource0 = DynamicFrame(jdbcDF, glueContext)
Works with any SQL query.
It turns out you only need a username/pwd to access Redshift in Spark, and it is done as follows (using the Python API):
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df = sqlContext.read.load(source="jdbc",
url="jdbc:postgresql://host:port/dbserver?user=yourusername&password=secret",
dbtable="schema.table"
)
Hope this helps someone!
Although this seems to be a very old post, anyone who is still looking for answer, below steps worked for me!
Start the shell including the jar.
bin/pyspark --driver-class-path /path_to_postgresql-42.1.4.jar --jars /path_to_postgresql-42.1.4.jar
Create a df by giving appropriate details:
myDF = spark.read \
.format("jdbc") \
.option("url", "jdbc:redshift://host:port/db_name") \
.option("dbtable", "table_name") \
.option("user", "user_name") \
.option("password", "password") \
.load()
Spark Version: 2.2
You first need to download Postgres JDBC driver. You can find it here: https://jdbc.postgresql.org/
You can either define your environment variable SPARK_CLASSPATH in .bashrc
, conf/spark-env.sh or similar file or specify it in the script before you run your IPython notebook.
You can also define it in your conf/spark-defaults.conf in the following way:
spark.driver.extraClassPath /path/to/file/postgresql-9.4-1201.jdbc41.jar
Make sure it is reflected in the Environment tab of your Spark WebUI.
You will also need to set appropriate AWS credentials in the following way:
sparkContext.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", "***")
sparkContext.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", "***")
If you're using Spark 1.4.0 or newer, check out spark-redshift, a library which supports loading data from Redshift into Spark SQL DataFrames and saving DataFrames back to Redshift. If you're querying large volumes of data, this approach should perform better than JDBC because it will be able to unload and query the data in parallel.
If you still want to use JDBC, check out the new built-in JDBC data source in Spark 1.4+.
Disclosure: I'm one of the authors of spark-redshift
.