How to connect HBase and Spark using Python?

后端 未结 1 2012
挽巷
挽巷 2020-12-01 00:31

I have an embarrassingly parallel task for which I use Spark to distribute the computations. These computations are in Python, and I use PySpark to read and preprocess the d

相关标签:
1条回答
  • 2020-12-01 00:55

    I found this comment by one of the makers of hbase-spark, which seems to suggest there is a way to use PySpark to query HBase using Spark SQL.

    And indeed, the pattern described here can be applied to query HBase with Spark SQL using PySpark, as the following example shows:

    from pyspark import SparkContext
    from pyspark.sql import SQLContext
    
    sc = SparkContext()
    sqlc = SQLContext(sc)
    
    data_source_format = 'org.apache.hadoop.hbase.spark'
    
    df = sc.parallelize([('a', '1.0'), ('b', '2.0')]).toDF(schema=['col0', 'col1'])
    
    # ''.join(string.split()) in order to write a multi-line JSON string here.
    catalog = ''.join("""{
        "table":{"namespace":"default", "name":"testtable"},
        "rowkey":"key",
        "columns":{
            "col0":{"cf":"rowkey", "col":"key", "type":"string"},
            "col1":{"cf":"cf", "col":"col1", "type":"string"}
        }
    }""".split())
    
    
    # Writing
    df.write\
    .options(catalog=catalog)\  # alternatively: .option('catalog', catalog)
    .format(data_source_format)\
    .save()
    
    # Reading
    df = sqlc.read\
    .options(catalog=catalog)\
    .format(data_source_format)\
    .load()
    

    I've tried hbase-spark-1.2.0-cdh5.7.0.jar (as distributed by Cloudera) for this, but ran into trouble (org.apache.hadoop.hbase.spark.DefaultSource does not allow create table as select when writing, java.util.NoSuchElementException: None.get when reading). As it turns out, the present version of CDH does not include the changes to hbase-spark that allow Spark SQL-HBase integration.

    What does work for me is the shc Spark package, found here. The only change I had to make to the above script is to change:

    data_source_format = 'org.apache.spark.sql.execution.datasources.hbase'
    

    Here's how I submit the above script on my CDH cluster, following the example from the shc README:

    spark-submit --packages com.hortonworks:shc:1.0.0-1.6-s_2.10 --repositories http://repo.hortonworks.com/content/groups/public/ --files /opt/cloudera/parcels/CDH/lib/hbase/conf/hbase-site.xml example.py
    

    Most of the work on shc seems to already be merged into the hbase-spark module of HBase, for release in version 2.0. With that, Spark SQL querying of HBase is possible using the above-mentioned pattern (see: https://hbase.apache.org/book.html#_sparksql_dataframes for details). My example above shows what it looks like for PySpark users.

    Finally, a caveat: my example data above has only strings. Python data conversion is not supported by shc, so I had problems with integers and floats not showing up in HBase or with weird values.

    0 讨论(0)
提交回复
热议问题