I have an embarrassingly parallel task for which I use Spark to distribute the computations. These computations are in Python, and I use PySpark to read and preprocess the d
I found this comment by one of the makers of hbase-spark
, which seems to suggest there is a way to use PySpark to query HBase using Spark SQL.
And indeed, the pattern described here can be applied to query HBase with Spark SQL using PySpark, as the following example shows:
from pyspark import SparkContext
from pyspark.sql import SQLContext
sc = SparkContext()
sqlc = SQLContext(sc)
data_source_format = 'org.apache.hadoop.hbase.spark'
df = sc.parallelize([('a', '1.0'), ('b', '2.0')]).toDF(schema=['col0', 'col1'])
# ''.join(string.split()) in order to write a multi-line JSON string here.
catalog = ''.join("""{
"table":{"namespace":"default", "name":"testtable"},
"rowkey":"key",
"columns":{
"col0":{"cf":"rowkey", "col":"key", "type":"string"},
"col1":{"cf":"cf", "col":"col1", "type":"string"}
}
}""".split())
# Writing
df.write\
.options(catalog=catalog)\ # alternatively: .option('catalog', catalog)
.format(data_source_format)\
.save()
# Reading
df = sqlc.read\
.options(catalog=catalog)\
.format(data_source_format)\
.load()
I've tried hbase-spark-1.2.0-cdh5.7.0.jar
(as distributed by Cloudera) for this, but ran into trouble (org.apache.hadoop.hbase.spark.DefaultSource does not allow create table as select
when writing, java.util.NoSuchElementException: None.get
when reading). As it turns out, the present version of CDH does not include the changes to hbase-spark
that allow Spark SQL-HBase integration.
What does work for me is the shc
Spark package, found here. The only change I had to make to the above script is to change:
data_source_format = 'org.apache.spark.sql.execution.datasources.hbase'
Here's how I submit the above script on my CDH cluster, following the example from the shc
README:
spark-submit --packages com.hortonworks:shc:1.0.0-1.6-s_2.10 --repositories http://repo.hortonworks.com/content/groups/public/ --files /opt/cloudera/parcels/CDH/lib/hbase/conf/hbase-site.xml example.py
Most of the work on shc
seems to already be merged into the hbase-spark
module of HBase, for release in version 2.0. With that, Spark SQL querying of HBase is possible using the above-mentioned pattern (see: https://hbase.apache.org/book.html#_sparksql_dataframes for details). My example above shows what it looks like for PySpark users.
Finally, a caveat: my example data above has only strings. Python data conversion is not supported by shc
, so I had problems with integers and floats not showing up in HBase or with weird values.