I\'m trying to connect Spark with amazon Redshift but i\'m getting this error :
My code is as follow :
from pyspark.sql import SQLContext
f
I think that you need to add .format("com.databricks.spark.redshift")
to your sql_context.read
call; my hunch is that Spark can't infer the format for this data source, so you need to explicitly specify that we should use the spark-redshift
connector.
For more detail on this error, see https://github.com/databricks/spark-redshift/issues/230
if you are using databricks, I think you don't have to create a new sql Context because they do that for you just have to use sqlContext, try with this code:
from pyspark.sql import SQLContext
sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", "YOUR_KEY_ID")
sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", "YOUR_SECRET_ACCESS_KEY")
df = sqlContext.read \ .......
Maybe the bucket is not mounted
dbutils.fs.mount("s3a://%s:%s@%s" % (ACCESS_KEY, ENCODED_SECRET_KEY, AWS_BUCKET_NAME), "/mnt/%s" % MOUNT_NAME)
If you are using Spark 2.0.4 and running your code on AWS EMR cluster, then please follow the below steps:-
1) Download the Redshift JDBC jar by using the below command:-
wget https://s3.amazonaws.com/redshift-downloads/drivers/jdbc/1.2.20.1043/RedshiftJDBC4-no-awssdk-1.2.20.1043.jar
Refernce:- AWS Document
2) Copy the below-mentioned code in a python file and then replace the required values with your AWS resource:-
import pyspark
from pyspark.sql import SQLContext
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
spark._jsc.hadoopConfiguration().set("fs.s3.awsAccessKeyId", "access key")
spark._jsc.hadoopConfiguration().set("fs.s3.awsSecretAccessKey", "secret access key")
sqlCon = SQLContext(spark)
df = sqlCon.createDataFrame([
(1, "A", "X1"),
(2, "B", "X2"),
(3, "B", "X3"),
(1, "B", "X3"),
(2, "C", "X2"),
(3, "C", "X2"),
(1, "C", "X1"),
(1, "B", "X1"),
], ["ID", "TYPE", "CODE"])
df.write \
.format("com.databricks.spark.redshift") \
.option("url", "jdbc:redshift://HOST_URL:5439/DATABASE_NAME?user=USERID&password=PASSWORD") \
.option("dbtable", "TABLE_NAME") \
.option("aws_region", "us-west-1") \
.option("tempdir", "s3://BUCKET_NAME/PATH/") \
.mode("error") \
.save()
3) Run the below spark-submit command:-
spark-submit --name "App Name" --jars RedshiftJDBC4-no-awssdk-1.2.20.1043.jar --packages com.databricks:spark-redshift_2.10:2.0.0,org.apache.spark:spark-avro_2.11:2.4.0,com.eclipsesource.minimal-json:minimal-json:0.9.4 --py-files python_script.py python_script.py
Note:-
1) The Public IP address of the EMR node (on which the spark-submit job will run) should be allowed in the inbound rule of security group of Reshift cluster.
2) Redshift cluster and the S3 location used under "tempdir" should be there in the same geo-location. Here in the above example, both the resources are in us-west-1.
3) If the data is sensitive then do make sure to secure all the channels. To make the connections secure please follow the steps mentioned here under configuration.
The error is due to missing dependencies.
Verify that you have these jar files in the spark home directory:
Put these jar files in $SPARK_HOME/jars/ and then start spark
pyspark --jars $SPARK_HOME/jars/spark-redshift_2.10-3.0.0-preview1.jar,$SPARK_HOME/jars/RedshiftJDBC41-1.1.10.1010.jar,$SPARK_HOME/jars/hadoop-aws-2.7.1.jar,$SPARK_HOME/jars/aws-java-sdk-s3-1.11.60.jar,$SPARK_HOME/jars/aws-java-sdk-1.7.4.jar
(SPARK_HOME should be = "/usr/local/Cellar/apache-spark/$SPARK_VERSION/libexec")
This will run Spark with all necessary dependencies. Note that you also need to specify the authentication type 'forward_spark_s3_credentials'=True if you are using awsAccessKeys.
from pyspark.sql import SQLContext
from pyspark import SparkContext
sc = SparkContext(appName="Connect Spark with Redshift")
sql_context = SQLContext(sc)
sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", <ACCESSID>)
sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", <ACCESSKEY>)
df = sql_context.read \
.format("com.databricks.spark.redshift") \
.option("url", "jdbc:redshift://example.coyf2i236wts.eu-central- 1.redshift.amazonaws.com:5439/agcdb?user=user&password=pwd") \
.option("dbtable", "table_name") \
.option('forward_spark_s3_credentials',True) \
.option("tempdir", "s3n://bucket") \
.load()
Common errors afterwards are:
.option("url", "jdbc:redshift://example.coyf2i236wts.eu-central- 1.redshift.amazonaws.com:5439/agcdb?user=user&password=pwd?ssl=true&sslfactory=org.postgresql.ssl.NonValidatingFactory")
Here is a step by step process for connecting to redshift.
wget "https://s3.amazonaws.com/redshift-downloads/drivers/RedshiftJDBC4-1.2.1.1001.jar"
from pyspark.conf import SparkConf
from pyspark.sql import SparkSession
#initialize the spark session
spark = SparkSession.builder.master("yarn").appName("Connect to redshift").enableHiveSupport().getOrCreate()
sc = spark.sparkContext
sqlContext = HiveContext(sc)
sc._jsc.hadoopConfiguration().set("fs.s3.awsAccessKeyId", "<ACCESSKEYID>")
sc._jsc.hadoopConfiguration().set("fs.s3.awsSecretAccessKey", "<ACCESSKEYSECTRET>")
taxonomyDf = sqlContext.read \
.format("com.databricks.spark.redshift") \
.option("url", "jdbc:postgresql://url.xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx") \
.option("dbtable", "table_name") \
.option("tempdir", "s3://mybucket/") \
.load()
spark-submit --packages com.databricks:spark-redshift_2.10:0.5.0 --jars RedshiftJDBC4-1.2.1.1001.jar test.py
I think the s3n://
URL style has been deprecated and/or removed.
Try defining your keys as "fs.s3.awsAccessKeyId"
.