How to connect to Amazon Redshift or other DB's in Apache Spark?

后端未结

关注

 6  2030

I\'m trying to connect to Amazon Redshift via Spark, so I can join data we have on S3 with data on our RS cluster. I found some very spartan documentation here for the capab

相关标签:

6条回答

温柔的废话

2021-01-13 10:02

The simplest way to make a jdbc connection to Redshift using python is as follows:

# -*- coding: utf-8 -*-
from pyspark.sql import SparkSession

jdbc_url = "jdbc:redshift://xxx.xxx.redshift.amazonaws.com:5439/xxx"
jdbc_user = "xxx"
jdbc_password = "xxx"
jdbc_driver = "com.databricks.spark.redshift"

spark = SparkSession.builder.master("yarn") \
.config("hive.metastore.client.factory.class", "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory") \
.enableHiveSupport().getOrCreate()

# Read data from a query
df = spark.read \
    .format(jdbc_driver) \
    .option("url", jdbc_url + "?user="+ jdbc_user +"&password="+ jdbc_password) \
    .option("query", "your query") \
    .load()

0 讨论(0)

[愿得一人]

2021-01-13 10:05

This worked for in Scala in AWS Glue with Spark 2.4:

val spark: SparkContext = new SparkContext()
val glueContext: GlueContext = new GlueContext(spark)
Job.init(args("JOB_NAME"), glueContext, args.asJava)

val sqlContext = new org.apache.spark.sql.SQLContext(spark)
val jdbcDF = sqlContext.read.format("jdbc").options(
  Map("url" -> "jdbc:postgresql://HOST:PORT/DBNAME?user=USERNAME&password=PASSWORD",
  "dbtable" -> "(SELECT a.row_name FROM schema_name.table_name a) as from_redshift")).load()

// back to DynamicFrame
val datasource0 = DynamicFrame(jdbcDF, glueContext)

Works with any SQL query.

0 讨论(0)

青春惊慌失措

2021-01-13 10:07

It turns out you only need a username/pwd to access Redshift in Spark, and it is done as follows (using the Python API):

from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df = sqlContext.read.load(source="jdbc", 
                     url="jdbc:postgresql://host:port/dbserver?user=yourusername&password=secret", 
                     dbtable="schema.table"
)

Hope this helps someone!

0 讨论(0)

傲寒

2021-01-13 10:09

Although this seems to be a very old post, anyone who is still looking for answer, below steps worked for me!

Start the shell including the jar.

bin/pyspark --driver-class-path /path_to_postgresql-42.1.4.jar --jars /path_to_postgresql-42.1.4.jar

Create a df by giving appropriate details:

myDF = spark.read \
    .format("jdbc") \
    .option("url", "jdbc:redshift://host:port/db_name") \
    .option("dbtable", "table_name") \
    .option("user", "user_name") \
    .option("password", "password") \
    .load()

Spark Version: 2.2

0 讨论(0)

不思量自难忘°

2021-01-13 10:17
You first need to download Postgres JDBC driver. You can find it here: https://jdbc.postgresql.org/

You can either define your environment variable SPARK_CLASSPATH in .bashrc, conf/spark-env.sh or similar file or specify it in the script before you run your IPython notebook.

You can also define it in your conf/spark-defaults.conf in the following way:
```
spark.driver.extraClassPath  /path/to/file/postgresql-9.4-1201.jdbc41.jar
```
Make sure it is reflected in the Environment tab of your Spark WebUI.

You will also need to set appropriate AWS credentials in the following way:
```
sparkContext.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", "***")
sparkContext.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", "***")
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
无人共我

2021-01-13 10:19

If you're using Spark 1.4.0 or newer, check out spark-redshift, a library which supports loading data from Redshift into Spark SQL DataFrames and saving DataFrames back to Redshift. If you're querying large volumes of data, this approach should perform better than JDBC because it will be able to unload and query the data in parallel.

If you still want to use JDBC, check out the new built-in JDBC data source in Spark 1.4+.

Disclosure: I'm one of the authors of spark-redshift.

0 讨论(0)
发布评论:

提交评论
- 加载中...