问题
I need to read from a postgres sql database in pyspark. I know this has been asked before such as here, here and many other places, however, the solutions there either use a jar in the local running directory or copy it to all workers manually.
I downloaded the postgresql-9.4.1208 jar and placed it in /tmp/jars. I then proceeded to call pyspark with the --jars and --driver-class-path switches:
pyspark --master yarn --jars /tmp/jars/postgresql-9.4.1208.jar --driver-class-path /tmp/jars/postgresql-9.4.1208.jar
Inside pyspark I did:
df = sqlContext.read.format("jdbc").options(url="jdbc:postgresql://ip_address:port/db_name?user=myuser&password=mypasswd", dbtable="table_name").load()
df.count()
However, while using --jars and --driver-class-path worked fine for jars I created, it failed for jdbc and I got an exception from the workers:
java.lang.IllegalStateException: Did not find registered driver with class org.postgresql.Driver
If I copy the jar manually to all workers and add --conf spark.executor.extraClassPath and --conf spark.driver.extraClassPath, it does work (with the same jar). The documentation btw suggests using SPARK_CLASSPATH which is deprecated actually adds these two switches (but has the side effect of preventing adding OTHER jars with the --jars option which I need to do)
So my question is: what is special about the jdbc driver which makes it not work and how can I add it without having to manually copy it to all workers.
Update:
I did some more looking and found this in the documentation: "The JDBC driver class must be visible to the primordial class loader on the client session and on all executors. This is because Java’s DriverManager class does a security check that results in it ignoring all drivers not visible to the primordial class loader when one goes to open a connection. One convenient way to do this is to modify compute_classpath.sh on all worker nodes to include your driver JARs.".
The problem is I can't seem to find computer_classpath.sh nor do I understand what the primordial class loader means.
I did find this which basically explains that this needs to be done locally. I also found this which basically says there is a fix but it is not yet available in version 1.6.1.
回答1:
I found a solution which works (Don't know if it is the best one so feel free to continue commenting). Apparently, If I add the option: driver="org.postgresql.Driver", this works properly. i.e. My full line (inside pyspark) is:
df = sqlContext.read.format("jdbc").options(url="jdbc:postgresql://ip_address:port/db_name?user=myuser&password=mypasswd", dbtable="table_name",driver="org.postgresql.Driver").load()
df.count()
Another thing: If you are already using a fat jar of your own (I am in my full application) then all you need to do is add the jdbc driver to your pom file as such:
<dependency>
<groupId>org.postgresql</groupId>
<artifactId>postgresql</artifactId>
<version>9.4.1208</version>
</dependency>
and then you don't have to add the driver as a separate jar, just use the jar with dependencies.
回答2:
What version of the documentation are you looking at ?
It seems like compute-classpath.sh
was deprecated a while back - as of Spark 1.3.1:
$ unzip -l spark-1.3.1.zip | egrep '\.sh' | egrep classpa
6592 2015-04-11 00:04 spark-1.3.1/bin/compute-classpath.sh
$ unzip -l spark-1.4.0.zip | egrep '\.sh' | egrep classpa
produces nothing.
I think you should be using load-spark-env.sh to set your classpath:
$/opt/spark-1.6.0-bin-hadoop2.6/bin/load-spark-env.sh
and you'll need to set SPARK_CLASSPATH in your $SPARK_HOME/conf/spark-env.sh
file (which you'll copy over from the template file $SPARK_HOME/conf/spark-env.sh.template
).
回答3:
I think that this is another manifestation of the issue described and fixed here: https://github.com/apache/spark/pull/12000. I authored that fix 3 weeks ago and there has been no movement on it. Maybe if others also express the fact that they have been affected by it, it may help?
来源:https://stackoverflow.com/questions/36326066/working-with-jdbc-jar-in-pyspark