How to process a range of hbase rows using spark?

后端 未结 3 713
無奈伤痛
無奈伤痛 2021-02-04 14:04

I am trying to use HBase as a data source for spark. So the first step turns out to be creating a RDD from a HBase table. Since Spark works with hadoop input formats, i could fi

3条回答
  •  梦毁少年i
    2021-02-04 14:09

    Here is an example of using Scan in Spark:

    import java.io.{DataOutputStream, ByteArrayOutputStream}
    import java.lang.String
    import org.apache.hadoop.hbase.client.Scan
    import org.apache.hadoop.hbase.HBaseConfiguration
    import org.apache.hadoop.hbase.io.ImmutableBytesWritable
    import org.apache.hadoop.hbase.client.Result
    import org.apache.hadoop.hbase.mapreduce.TableInputFormat
    import org.apache.hadoop.hbase.util.Base64
    
    def convertScanToString(scan: Scan): String = {
      val out: ByteArrayOutputStream = new ByteArrayOutputStream
      val dos: DataOutputStream = new DataOutputStream(out)
      scan.write(dos)
      Base64.encodeBytes(out.toByteArray)
    }
    
    val conf = HBaseConfiguration.create()
    val scan = new Scan()
    scan.setCaching(500)
    scan.setCacheBlocks(false)
    conf.set(TableInputFormat.INPUT_TABLE, "table_name")
    conf.set(TableInputFormat.SCAN, convertScanToString(scan))
    val rdd = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat], classOf[ImmutableBytesWritable], classOf[Result])
    rdd.count
    

    You need to add related libraries to the Spark classpath and make sure they are compatible with your Spark. Tips: you can use hbase classpath to find them.

提交回复
热议问题