Spark Redshift with Python

后端未结

关注

 6  1079

I\'m trying to connect Spark with amazon Redshift but i\'m getting this error :

My code is as follow :

from pyspark.sql import SQLContext
f


                      
              相关标签:


      
      
        
          6条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  遇见更好的自我        
                
              
                            
                2021-01-03 11:23
              
            
            
                                                                       
I think that you need to add .format("com.databricks.spark.redshift") to your sql_context.read call; my hunch is that Spark can't infer the format for this data source, so you need to explicitly specify that we should use the spark-redshift connector.

For more detail on this error, see https://github.com/databricks/spark-redshift/issues/230
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  时光说笑        
                
              
                            
                2021-01-03 11:25
              
            
            
                                                                       
if you are using databricks, I think you don't have to create a new sql Context because they do that for you just have to use sqlContext,  try with this code: 

from pyspark.sql import SQLContext
    sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", "YOUR_KEY_ID")
    sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", "YOUR_SECRET_ACCESS_KEY")

df = sqlContext.read \ .......


Maybe the bucket is not mounted 

dbutils.fs.mount("s3a://%s:%s@%s" % (ACCESS_KEY, ENCODED_SECRET_KEY, AWS_BUCKET_NAME), "/mnt/%s" % MOUNT_NAME)

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  不知归路        
                
              
                            
                2021-01-03 11:33
              
            
            
                                                                       
If you are using Spark 2.0.4 and running your code on AWS EMR cluster, then please follow the below steps:-

1) Download the Redshift JDBC jar by using the below command:-

wget https://s3.amazonaws.com/redshift-downloads/drivers/jdbc/1.2.20.1043/RedshiftJDBC4-no-awssdk-1.2.20.1043.jar


Refernce:- AWS Document

2) Copy the below-mentioned code in a python file and then replace the required values with your AWS resource:- 

import pyspark
from pyspark.sql import SQLContext
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

spark._jsc.hadoopConfiguration().set("fs.s3.awsAccessKeyId", "access key")
spark._jsc.hadoopConfiguration().set("fs.s3.awsSecretAccessKey", "secret access key")

sqlCon = SQLContext(spark)
df = sqlCon.createDataFrame([
    (1, "A", "X1"),
    (2, "B", "X2"),
    (3, "B", "X3"),
    (1, "B", "X3"),
    (2, "C", "X2"),
    (3, "C", "X2"),
    (1, "C", "X1"),
    (1, "B", "X1"),
], ["ID", "TYPE", "CODE"])

df.write \
  .format("com.databricks.spark.redshift") \
  .option("url", "jdbc:redshift://HOST_URL:5439/DATABASE_NAME?user=USERID&password=PASSWORD") \
  .option("dbtable", "TABLE_NAME") \
  .option("aws_region", "us-west-1") \
  .option("tempdir", "s3://BUCKET_NAME/PATH/") \
  .mode("error") \
  .save()


3) Run the below spark-submit command:-

spark-submit --name "App Name" --jars RedshiftJDBC4-no-awssdk-1.2.20.1043.jar --packages com.databricks:spark-redshift_2.10:2.0.0,org.apache.spark:spark-avro_2.11:2.4.0,com.eclipsesource.minimal-json:minimal-json:0.9.4 --py-files python_script.py python_script.py


Note:- 

1) The Public IP address of the EMR node (on which the spark-submit job will run) should be allowed in the inbound rule of security group of Reshift cluster.

2) Redshift cluster and the S3 location used under "tempdir" should be there in the same geo-location. Here in the above example, both the resources are in us-west-1. 

3) If the data is sensitive then do make sure to secure all the channels. To make the connections secure please follow the steps mentioned here
under configuration.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  余生分开走        
                
              
                            
                2021-01-03 11:35
              
            
            
                                                                       
The error is due to missing dependencies. 

Verify that you have these jar files in the spark home directory:


spark-redshift_2.10-3.0.0-preview1.jar
RedshiftJDBC41-1.1.10.1010.jar
hadoop-aws-2.7.1.jar
aws-java-sdk-1.7.4.jar
(aws-java-sdk-s3-1.11.60.jar) (newer version but not everything worked with it)


Put these jar files in $SPARK_HOME/jars/ and then start spark

pyspark --jars $SPARK_HOME/jars/spark-redshift_2.10-3.0.0-preview1.jar,$SPARK_HOME/jars/RedshiftJDBC41-1.1.10.1010.jar,$SPARK_HOME/jars/hadoop-aws-2.7.1.jar,$SPARK_HOME/jars/aws-java-sdk-s3-1.11.60.jar,$SPARK_HOME/jars/aws-java-sdk-1.7.4.jar


(SPARK_HOME should be = "/usr/local/Cellar/apache-spark/$SPARK_VERSION/libexec")

This will run Spark with all necessary dependencies. Note that you also need to specify the authentication type 'forward_spark_s3_credentials'=True if you are using awsAccessKeys.

from pyspark.sql import SQLContext
from pyspark import SparkContext

sc = SparkContext(appName="Connect Spark with Redshift")
sql_context = SQLContext(sc)
sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", <ACCESSID>)
sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", <ACCESSKEY>)

df = sql_context.read \
     .format("com.databricks.spark.redshift") \
     .option("url", "jdbc:redshift://example.coyf2i236wts.eu-central-    1.redshift.amazonaws.com:5439/agcdb?user=user&password=pwd") \
     .option("dbtable", "table_name") \
     .option('forward_spark_s3_credentials',True) \
     .option("tempdir", "s3n://bucket") \
     .load()


Common errors afterwards are: 


Redshift Connection Error: "SSL off" 


Solution:
.option("url", "jdbc:redshift://example.coyf2i236wts.eu-central-    1.redshift.amazonaws.com:5439/agcdb?user=user&password=pwd?ssl=true&sslfactory=org.postgresql.ssl.NonValidatingFactory")

S3 Error: When unloading the data, e.g. after df.show() you get the message: "The bucket you are attempting to access must be addressed using the specified endpoint. Please send all future requests to this endpoint."


Solution: The bucket & cluster must be run within the same region


                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  执笔经年        
                
              
                            
                2021-01-03 11:37
              
            
            
                                                                       
Here is a step by step process for connecting to redshift.


Download the redshift connector file . try the below command




wget "https://s3.amazonaws.com/redshift-downloads/drivers/RedshiftJDBC4-1.2.1.1001.jar"



save the below code in a python file(.py that you want to run) and
replace the credentials accordingly.




from pyspark.conf import SparkConf
from pyspark.sql import SparkSession

#initialize the spark session 
spark = SparkSession.builder.master("yarn").appName("Connect to redshift").enableHiveSupport().getOrCreate()
sc = spark.sparkContext
sqlContext = HiveContext(sc)

sc._jsc.hadoopConfiguration().set("fs.s3.awsAccessKeyId", "<ACCESSKEYID>")
sc._jsc.hadoopConfiguration().set("fs.s3.awsSecretAccessKey", "<ACCESSKEYSECTRET>")


taxonomyDf = sqlContext.read \
    .format("com.databricks.spark.redshift") \
    .option("url", "jdbc:postgresql://url.xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx") \
    .option("dbtable", "table_name") \
    .option("tempdir", "s3://mybucket/") \
    .load() 



run the spark-submit like below




spark-submit --packages com.databricks:spark-redshift_2.10:0.5.0 --jars RedshiftJDBC4-1.2.1.1001.jar test.py

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  我寻月下人不归        
                
              
                            
                2021-01-03 11:39
              
            
            
                                                                       
I think the s3n:// URL style has been deprecated and/or removed. 

Try defining your keys as "fs.s3.awsAccessKeyId".
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复