问题
I'm working on a project to bulk load data from a CSV file to HBase using Spark streaming. The code I'm using is as follows (adapted from here):
def bulk_load(rdd):
conf = {#removed for brevity}
keyConv = "org.apache.spark.examples.pythonconverters.StringToImmutableBytesWritableConverter"
valueConv = "org.apache.spark.examples.pythonconverters.StringListToPutConverter"
load_rdd = rdd.flatMap(lambda line: line.split("\n"))\
.flatMap(csv_to_key_value)
load_rdd.saveAsNewAPIHadoopDataset(conf=conf,keyConverter=keyConv,valueConverter=valueConv)
Everything up to and including the two flatMap
s works as expected. However, when trying to execute saveAsNewAPIHadoopDataset
I get the following runtime error:
java.lang.ClassNotFoundException: org.apache.spark.examples.pythonconverters.StringToImmutableBytesWritableConverter
I have set PYTHONPATH
to point to the jar containing this class (as well as my other converter class) however this does not seem to have improved the situation. Any advice would be greatly appreciated. Thanks in advance.
回答1:
Took some digging, but here's the solution:
The jars did not need to be added to PYTHONPATH
as I thought, but rather to the Spark config. I added to following properties to the config (Custom spark-defaults under Ambari)
spark.driver.extraClassPath
and spark.executor.extraClassPath
To each of these I added the following jars:
/usr/hdp/2.3.2.0-2950/spark/lib/spark-examples-1.4.1.2.3.2.0-2950-hadoop2.7.1.2.3.2.0-2950.jar
/usr/hdp/2.3.2.0-2950/hbase/lib/hbase-common-1.1.2.2.3.2.0-2950.jar
/usr/hdp/2.3.2.0-2950/hbase/lib/hbase-client-1.1.2.2.3.2.0-2950.jar
/usr/hdp/2.3.2.0-2950/hbase/lib/hbase-protocol-1.1.2.2.3.2.0-2950.jar
/usr/hdp/2.3.2.0-2950/hbase/lib/guava-12.0.1.jar
/usr/hdp/2.3.2.0-2950/hbase/lib/hbase-server-1.1.2.2.3.2.0-2950.jar
Adding these jars has allowed spark to see all the necessary files.
来源:https://stackoverflow.com/questions/34898054/spark-streaming-with-python-class-not-found-exception