Spark Redshift with Python

后端 未结 6 1077
梦如初夏
梦如初夏 2021-01-03 10:39

I\'m trying to connect Spark with amazon Redshift but i\'m getting this error :

My code is as follow :

from pyspark.sql import SQLContext
f         


        
相关标签:
6条回答
  • 2021-01-03 11:23

    I think that you need to add .format("com.databricks.spark.redshift") to your sql_context.read call; my hunch is that Spark can't infer the format for this data source, so you need to explicitly specify that we should use the spark-redshift connector.

    For more detail on this error, see https://github.com/databricks/spark-redshift/issues/230

    0 讨论(0)
  • 2021-01-03 11:25

    if you are using databricks, I think you don't have to create a new sql Context because they do that for you just have to use sqlContext, try with this code:

    from pyspark.sql import SQLContext
        sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", "YOUR_KEY_ID")
        sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", "YOUR_SECRET_ACCESS_KEY")
    
    df = sqlContext.read \ .......
    

    Maybe the bucket is not mounted

    dbutils.fs.mount("s3a://%s:%s@%s" % (ACCESS_KEY, ENCODED_SECRET_KEY, AWS_BUCKET_NAME), "/mnt/%s" % MOUNT_NAME)
    
    0 讨论(0)
  • 2021-01-03 11:33

    If you are using Spark 2.0.4 and running your code on AWS EMR cluster, then please follow the below steps:-

    1) Download the Redshift JDBC jar by using the below command:-

    wget https://s3.amazonaws.com/redshift-downloads/drivers/jdbc/1.2.20.1043/RedshiftJDBC4-no-awssdk-1.2.20.1043.jar
    

    Refernce:- AWS Document

    2) Copy the below-mentioned code in a python file and then replace the required values with your AWS resource:-

    import pyspark
    from pyspark.sql import SQLContext
    from pyspark.sql import SparkSession
    
    spark = SparkSession.builder.getOrCreate()
    
    spark._jsc.hadoopConfiguration().set("fs.s3.awsAccessKeyId", "access key")
    spark._jsc.hadoopConfiguration().set("fs.s3.awsSecretAccessKey", "secret access key")
    
    sqlCon = SQLContext(spark)
    df = sqlCon.createDataFrame([
        (1, "A", "X1"),
        (2, "B", "X2"),
        (3, "B", "X3"),
        (1, "B", "X3"),
        (2, "C", "X2"),
        (3, "C", "X2"),
        (1, "C", "X1"),
        (1, "B", "X1"),
    ], ["ID", "TYPE", "CODE"])
    
    df.write \
      .format("com.databricks.spark.redshift") \
      .option("url", "jdbc:redshift://HOST_URL:5439/DATABASE_NAME?user=USERID&password=PASSWORD") \
      .option("dbtable", "TABLE_NAME") \
      .option("aws_region", "us-west-1") \
      .option("tempdir", "s3://BUCKET_NAME/PATH/") \
      .mode("error") \
      .save()
    

    3) Run the below spark-submit command:-

    spark-submit --name "App Name" --jars RedshiftJDBC4-no-awssdk-1.2.20.1043.jar --packages com.databricks:spark-redshift_2.10:2.0.0,org.apache.spark:spark-avro_2.11:2.4.0,com.eclipsesource.minimal-json:minimal-json:0.9.4 --py-files python_script.py python_script.py
    

    Note:-

    1) The Public IP address of the EMR node (on which the spark-submit job will run) should be allowed in the inbound rule of security group of Reshift cluster.

    2) Redshift cluster and the S3 location used under "tempdir" should be there in the same geo-location. Here in the above example, both the resources are in us-west-1.

    3) If the data is sensitive then do make sure to secure all the channels. To make the connections secure please follow the steps mentioned here under configuration.

    0 讨论(0)
  • 2021-01-03 11:35

    The error is due to missing dependencies.

    Verify that you have these jar files in the spark home directory:

    1. spark-redshift_2.10-3.0.0-preview1.jar
    2. RedshiftJDBC41-1.1.10.1010.jar
    3. hadoop-aws-2.7.1.jar
    4. aws-java-sdk-1.7.4.jar
    5. (aws-java-sdk-s3-1.11.60.jar) (newer version but not everything worked with it)

    Put these jar files in $SPARK_HOME/jars/ and then start spark

    pyspark --jars $SPARK_HOME/jars/spark-redshift_2.10-3.0.0-preview1.jar,$SPARK_HOME/jars/RedshiftJDBC41-1.1.10.1010.jar,$SPARK_HOME/jars/hadoop-aws-2.7.1.jar,$SPARK_HOME/jars/aws-java-sdk-s3-1.11.60.jar,$SPARK_HOME/jars/aws-java-sdk-1.7.4.jar
    

    (SPARK_HOME should be = "/usr/local/Cellar/apache-spark/$SPARK_VERSION/libexec")

    This will run Spark with all necessary dependencies. Note that you also need to specify the authentication type 'forward_spark_s3_credentials'=True if you are using awsAccessKeys.

    from pyspark.sql import SQLContext
    from pyspark import SparkContext
    
    sc = SparkContext(appName="Connect Spark with Redshift")
    sql_context = SQLContext(sc)
    sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", <ACCESSID>)
    sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", <ACCESSKEY>)
    
    df = sql_context.read \
         .format("com.databricks.spark.redshift") \
         .option("url", "jdbc:redshift://example.coyf2i236wts.eu-central-    1.redshift.amazonaws.com:5439/agcdb?user=user&password=pwd") \
         .option("dbtable", "table_name") \
         .option('forward_spark_s3_credentials',True) \
         .option("tempdir", "s3n://bucket") \
         .load()
    

    Common errors afterwards are:

    • Redshift Connection Error: "SSL off"
      • Solution: .option("url", "jdbc:redshift://example.coyf2i236wts.eu-central- 1.redshift.amazonaws.com:5439/agcdb?user=user&password=pwd?ssl=true&sslfactory=org.postgresql.ssl.NonValidatingFactory")
    • S3 Error: When unloading the data, e.g. after df.show() you get the message: "The bucket you are attempting to access must be addressed using the specified endpoint. Please send all future requests to this endpoint."
      • Solution: The bucket & cluster must be run within the same region
    0 讨论(0)
  • 2021-01-03 11:37

    Here is a step by step process for connecting to redshift.

    • Download the redshift connector file . try the below command
    wget "https://s3.amazonaws.com/redshift-downloads/drivers/RedshiftJDBC4-1.2.1.1001.jar"
    
    • save the below code in a python file(.py that you want to run) and replace the credentials accordingly.
    from pyspark.conf import SparkConf
    from pyspark.sql import SparkSession
    
    #initialize the spark session 
    spark = SparkSession.builder.master("yarn").appName("Connect to redshift").enableHiveSupport().getOrCreate()
    sc = spark.sparkContext
    sqlContext = HiveContext(sc)
    
    sc._jsc.hadoopConfiguration().set("fs.s3.awsAccessKeyId", "<ACCESSKEYID>")
    sc._jsc.hadoopConfiguration().set("fs.s3.awsSecretAccessKey", "<ACCESSKEYSECTRET>")
    
    
    taxonomyDf = sqlContext.read \
        .format("com.databricks.spark.redshift") \
        .option("url", "jdbc:postgresql://url.xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx") \
        .option("dbtable", "table_name") \
        .option("tempdir", "s3://mybucket/") \
        .load() 
    
    • run the spark-submit like below
    spark-submit --packages com.databricks:spark-redshift_2.10:0.5.0 --jars RedshiftJDBC4-1.2.1.1001.jar test.py
    
    0 讨论(0)
  • 2021-01-03 11:39

    I think the s3n:// URL style has been deprecated and/or removed.

    Try defining your keys as "fs.s3.awsAccessKeyId".

    0 讨论(0)
提交回复
热议问题