I am trying to connect to a database with pyspark and I am using the following code:
sqlctx = SQLContext(sc)
df = sqlctx.load(
url = "jdbc:postgresql
This exception means jdbc driver does not in driver classpath.
you can spark-submit jdbc jars with --jar
parameter, also add it into driver classpath using spark.driver.extraClassPath
One approach, building on the example per the quick start guide, is this blog post which shows how to add the --packages org.postgresql:postgresql:9.4.1211
argument to the spark-submit
This downloads the driver into ~/.ivy2/jars
directory, in my case /Users/derekhill/.ivy2/jars/org.postgresql_postgresql-9.4.1211.jar
. Passing this as the --driver-class-path
option gives the full spark-submit command of:
--packages org.postgresql:postgresql:9.4.1211\
--driver-class-path /Users/derekhill/.ivy2/jars/org.postgresql_postgresql-9.4.1211.jar\
--master local[4] main.py
And in main.py
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
dataframe = spark.read.format('jdbc').options(
url = "jdbc:postgresql://localhost/my_db?user=derekhill&password=''",
To use pyspark and jupyter notebook notebook: first open pyspark with
pyspark --driver-class-path /spark_drivers/postgresql-42.2.12.jar --jars /spark_drivers/postgresql-42.2.12.jar
Then in jupyter notebook
import os
jardrv = "~/spark_drivers/postgresql-42.2.12.jar"
from pyspark.sql import SparkSession
spark = SparkSession.builder.config('spark.driver.extraClassPath', jardrv).getOrCreate()
url = 'jdbc:postgresql://'
properties = {'user': 'usr', 'password': 'pswd'}
df = spark.read.jdbc(url=url, table='tablename', properties=properties)
It is necesary copy postgresql-42.1.4.jar in all nodes... for my case, I did copy in the path /opt/spark-2.2.0-bin-hadoop2.7/jars
Also, i set classpath in ~/.bashrc (export SPARK_CLASSPATH="/opt/spark-2.2.0-bin-hadoop2.7/jars" )
and work fine in pyspark console and jupyter
You normally need either:
If you detail how are you launching pyspark, we may give you more details.
Some clues/ideas:
Not able to connect to postgres using jdbc in pyspark shell
The following worked for me with postgres on localhost:
Download the PostgreSQL JDBC Driver from https://jdbc.postgresql.org/download.html.
For the pyspark
shell you use the SPARK_CLASSPATH
environment variable:
$ export SPARK_CLASSPATH=/path/to/downloaded/jar
$ pyspark
For submitting a script via spark-submit
use the --driver-class-path
$ spark-submit --driver-class-path /path/to/downloaded/jar script.py
In the python script load the tables as a DataFrame
as follows:
from pyspark.sql import DataFrameReader
url = 'postgresql://localhost:5432/dbname'
properties = {'user': 'username', 'password': 'password'}
df = DataFrameReader(sqlContext).jdbc(
url='jdbc:%s' % url, table='tablename', properties=properties
or alternatively:
df = sqlContext.read.format('jdbc').\
options(url='jdbc:%s' % url, dbtable='tablename').\
Note that when submitting the script via spark-submit
, you need to define the sqlContext