Using pyspark to connect to PostgreSQL

前端 未结 10 1363
逝去的感伤
逝去的感伤 2020-12-01 04:50

I am trying to connect to a database with pyspark and I am using the following code:

sqlctx = SQLContext(sc)
df = sqlctx.load(
    url = "jdbc:postgresql         


        
相关标签:
10条回答
  • 2020-12-01 05:19

    I had trouble to get a connection to the postgresDB with the jars i had on my computer. This code solved my problem with the driver

     from pyspark.sql import SparkSession
     import os
    
     sparkClassPath = os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.postgresql:postgresql:42.1.1 pyspark-shell'
    
     spark = SparkSession \
        .builder \
        .config("spark.driver.extraClassPath", sparkClassPath) \
        .getOrCreate()
    
     df = spark.read \
        .format("jdbc") \
        .option("url", "jdbc:postgresql://localhost:5432/yourDBname") \
        .option("driver", "org.postgresql.Driver") \
        .option("dbtable", "yourtablename") \
        .option("user", "postgres") \
        .option("password", "***") \
        .load()
    
    df.show()
    
    0 讨论(0)
  • 2020-12-01 05:19

    Just initialize pyspark with --jars <path/to/your/jdbc.jar>

    E.g.: pyspark --jars /path/Downloads/postgresql-42.2.16.jar

    then create a dataframe as suggested above in other answers

    E.g.:

    df2 = spark.read.format("jdbc").option("url", "jdbc:postgresql://localhost:5432/db").option("dbtable", "yourTableHere").option("user", "postgres").option("password", "postgres").option("driver", "org.postgresql.Driver").load()
    
    0 讨论(0)
  • 2020-12-01 05:22

    and i also meat this error

    java.sql.SQLException: No suitable driver
     at java.sql.DriverManager.getDriver(Unknown Source)
    

    and add one item .config('spark.driver.extraClassPath', './postgresql-42.2.18.jar') in SparkSession is fine and worked. like

    from pyspark import SparkContext, SparkConf
    import os
    from pyspark.sql.session import SparkSession
    
    spark = SparkSession \
        .builder \
        .appName('Python Spark Postgresql') \
        .config("spark.jars", "./postgresql-42.2.18.jar") \
        .config('spark.driver.extraClassPath', './postgresql-42.2.18.jar') \
        .getOrCreate()
    
    
    df = spark.read \
        .format("jdbc") \
        .option("url", "jdbc:postgresql://localhost:5432/abc") \
        .option("dbtable", 'tablename') \
        .option("user", "postgres") \
        .option("password", "1") \
        .load()
    
    df.printSchema()
    
    0 讨论(0)
  • 2020-12-01 05:26

    Download the PostgreSQL JDBC Driver from https://jdbc.postgresql.org/download.html

    Then replace the database configuration values by yours.

    from pyspark.sql import SparkSession
    
    spark = SparkSession \
        .builder \
        .appName("Python Spark SQL basic example") \
        .config("spark.jars", "/path_to_postgresDriver/postgresql-42.2.5.jar") \
        .getOrCreate()
    
    df = spark.read \
        .format("jdbc") \
        .option("url", "jdbc:postgresql://localhost:5432/databasename") \
        .option("dbtable", "tablename") \
        .option("user", "username") \
        .option("password", "password") \
        .option("driver", "org.postgresql.Driver") \
        .load()
    
    df.printSchema()
    

    More info: https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html

    0 讨论(0)
提交回复
热议问题