Spark Redshift with Python

后端 未结 6 1076
梦如初夏
梦如初夏 2021-01-03 10:39

I\'m trying to connect Spark with amazon Redshift but i\'m getting this error :

My code is as follow :

from pyspark.sql import SQLContext
f         


        
6条回答
  •  执笔经年
    2021-01-03 11:37

    Here is a step by step process for connecting to redshift.

    • Download the redshift connector file . try the below command
    wget "https://s3.amazonaws.com/redshift-downloads/drivers/RedshiftJDBC4-1.2.1.1001.jar"
    
    • save the below code in a python file(.py that you want to run) and replace the credentials accordingly.
    from pyspark.conf import SparkConf
    from pyspark.sql import SparkSession
    
    #initialize the spark session 
    spark = SparkSession.builder.master("yarn").appName("Connect to redshift").enableHiveSupport().getOrCreate()
    sc = spark.sparkContext
    sqlContext = HiveContext(sc)
    
    sc._jsc.hadoopConfiguration().set("fs.s3.awsAccessKeyId", "")
    sc._jsc.hadoopConfiguration().set("fs.s3.awsSecretAccessKey", "")
    
    
    taxonomyDf = sqlContext.read \
        .format("com.databricks.spark.redshift") \
        .option("url", "jdbc:postgresql://url.xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx") \
        .option("dbtable", "table_name") \
        .option("tempdir", "s3://mybucket/") \
        .load() 
    
    • run the spark-submit like below
    spark-submit --packages com.databricks:spark-redshift_2.10:0.5.0 --jars RedshiftJDBC4-1.2.1.1001.jar test.py
    

提交回复
热议问题