Using pyspark to connect to PostgreSQL

前端 未结 10 1362
逝去的感伤
逝去的感伤 2020-12-01 04:50

I am trying to connect to a database with pyspark and I am using the following code:

sqlctx = SQLContext(sc)
df = sqlctx.load(
    url = "jdbc:postgresql         


        
相关标签:
10条回答
  • 2020-12-01 05:04

    This exception means jdbc driver does not in driver classpath. you can spark-submit jdbc jars with --jar parameter, also add it into driver classpath using spark.driver.extraClassPath.

    0 讨论(0)
  • 2020-12-01 05:10

    One approach, building on the example per the quick start guide, is this blog post which shows how to add the --packages org.postgresql:postgresql:9.4.1211 argument to the spark-submit command.

    This downloads the driver into ~/.ivy2/jars directory, in my case /Users/derekhill/.ivy2/jars/org.postgresql_postgresql-9.4.1211.jar. Passing this as the --driver-class-path option gives the full spark-submit command of:

    /usr/local/Cellar/apache-spark/2.0.2/bin/spark-submit\
     --packages org.postgresql:postgresql:9.4.1211\
     --driver-class-path /Users/derekhill/.ivy2/jars/org.postgresql_postgresql-9.4.1211.jar\
     --master local[4] main.py
    

    And in main.py:

    from pyspark.sql import SparkSession
    
    spark = SparkSession.builder.getOrCreate()
    
    dataframe = spark.read.format('jdbc').options(
            url = "jdbc:postgresql://localhost/my_db?user=derekhill&password=''",
            database='my_db',
            dbtable='my_table'
        ).load()
    
    dataframe.show()
    
    0 讨论(0)
  • 2020-12-01 05:10

    To use pyspark and jupyter notebook notebook: first open pyspark with

    pyspark --driver-class-path /spark_drivers/postgresql-42.2.12.jar  --jars /spark_drivers/postgresql-42.2.12.jar
    

    Then in jupyter notebook

    import os
    jardrv = "~/spark_drivers/postgresql-42.2.12.jar"
    
    
    from pyspark.sql import SparkSession
    spark = SparkSession.builder.config('spark.driver.extraClassPath', jardrv).getOrCreate()
    url = 'jdbc:postgresql://127.0.0.1/dbname'
    properties = {'user': 'usr', 'password': 'pswd'}
    df = spark.read.jdbc(url=url, table='tablename', properties=properties)
    
    0 讨论(0)
  • 2020-12-01 05:12

    It is necesary copy postgresql-42.1.4.jar in all nodes... for my case, I did copy in the path /opt/spark-2.2.0-bin-hadoop2.7/jars

    Also, i set classpath in ~/.bashrc (export SPARK_CLASSPATH="/opt/spark-2.2.0-bin-hadoop2.7/jars" )

    and work fine in pyspark console and jupyter

    0 讨论(0)
  • 2020-12-01 05:12

    You normally need either:

    1. to install the Postgres Driver on your cluster,
    2. to provide the Postgres driver jar from your client with the --jars option
    3. or to provide the maven coordinates of the Postgres driver with --packages option.

    If you detail how are you launching pyspark, we may give you more details.

    Some clues/ideas:

    spark-cannot-find-the-postgres-jdbc-driver

    Not able to connect to postgres using jdbc in pyspark shell

    0 讨论(0)
  • 2020-12-01 05:14

    The following worked for me with postgres on localhost:

    Download the PostgreSQL JDBC Driver from https://jdbc.postgresql.org/download.html.

    For the pyspark shell you use the SPARK_CLASSPATH environment variable:

    $ export SPARK_CLASSPATH=/path/to/downloaded/jar
    $ pyspark
    

    For submitting a script via spark-submit use the --driver-class-path flag:

    $ spark-submit --driver-class-path /path/to/downloaded/jar script.py
    

    In the python script load the tables as a DataFrame as follows:

    from pyspark.sql import DataFrameReader
    
    url = 'postgresql://localhost:5432/dbname'
    properties = {'user': 'username', 'password': 'password'}
    df = DataFrameReader(sqlContext).jdbc(
        url='jdbc:%s' % url, table='tablename', properties=properties
    )
    

    or alternatively:

    df = sqlContext.read.format('jdbc').\
        options(url='jdbc:%s' % url, dbtable='tablename').\
        load()
    

    Note that when submitting the script via spark-submit, you need to define the sqlContext.

    0 讨论(0)
提交回复
热议问题