Working with jdbc jar in pyspark

问题

I need to read from a postgres sql database in pyspark. I know this has been asked before such as here, here and many other places, however, the solutions there either use a jar in the local running directory or copy it to all workers manually.

I downloaded the postgresql-9.4.1208 jar and placed it in /tmp/jars. I then proceeded to call pyspark with the --jars and --driver-class-path switches:

pyspark --master yarn --jars /tmp/jars/postgresql-9.4.1208.jar --driver-class-path /tmp/jars/postgresql-9.4.1208.jar

Inside pyspark I did:

df = sqlContext.read.format("jdbc").options(url="jdbc:postgresql://ip_address:port/db_name?user=myuser&password=mypasswd", dbtable="table_name").load()
df.count()

However, while using --jars and --driver-class-path worked fine for jars I created, it failed for jdbc and I got an exception from the workers:

 java.lang.IllegalStateException: Did not find registered driver with class org.postgresql.Driver

If I copy the jar manually to all workers and add --conf spark.executor.extraClassPath and --conf spark.driver.extraClassPath, it does work (with the same jar). The documentation btw suggests using SPARK_CLASSPATH which is deprecated actually adds these two switches (but has the side effect of preventing adding OTHER jars with the --jars option which I need to do)

So my question is: what is special about the jdbc driver which makes it not work and how can I add it without having to manually copy it to all workers.

Update:

I did some more looking and found this in the documentation: "The JDBC driver class must be visible to the primordial class loader on the client session and on all executors. This is because Java’s DriverManager class does a security check that results in it ignoring all drivers not visible to the primordial class loader when one goes to open a connection. One convenient way to do this is to modify compute_classpath.sh on all worker nodes to include your driver JARs.".

The problem is I can't seem to find computer_classpath.sh nor do I understand what the primordial class loader means.

I did find this which basically explains that this needs to be done locally. I also found this which basically says there is a fix but it is not yet available in version 1.6.1.

回答1:

I found a solution which works (Don't know if it is the best one so feel free to continue commenting). Apparently, If I add the option: driver="org.postgresql.Driver", this works properly. i.e. My full line (inside pyspark) is:

df = sqlContext.read.format("jdbc").options(url="jdbc:postgresql://ip_address:port/db_name?user=myuser&password=mypasswd", dbtable="table_name",driver="org.postgresql.Driver").load()
df.count()

Another thing: If you are already using a fat jar of your own (I am in my full application) then all you need to do is add the jdbc driver to your pom file as such:

    <dependency>
      <groupId>org.postgresql</groupId>
      <artifactId>postgresql</artifactId>
      <version>9.4.1208</version>
    </dependency>

and then you don't have to add the driver as a separate jar, just use the jar with dependencies.

回答2:

What version of the documentation are you looking at ? It seems like compute-classpath.sh was deprecated a while back - as of Spark 1.3.1:

$ unzip -l spark-1.3.1.zip | egrep '\.sh' | egrep classpa
 6592  2015-04-11 00:04   spark-1.3.1/bin/compute-classpath.sh

$ unzip -l spark-1.4.0.zip | egrep '\.sh' | egrep classpa

produces nothing.

I think you should be using load-spark-env.sh to set your classpath:

$/opt/spark-1.6.0-bin-hadoop2.6/bin/load-spark-env.sh

and you'll need to set SPARK_CLASSPATH in your $SPARK_HOME/conf/spark-env.sh file (which you'll copy over from the template file $SPARK_HOME/conf/spark-env.sh.template).

回答3:

I think that this is another manifestation of the issue described and fixed here: https://github.com/apache/spark/pull/12000. I authored that fix 3 weeks ago and there has been no movement on it. Maybe if others also express the fact that they have been affected by it, it may help?

来源：https://stackoverflow.com/questions/36326066/working-with-jdbc-jar-in-pyspark

标签

postgresql

jdbc

apache-spark

pyspark

pyspark-sql