How to process a range of hbase rows using spark?

后端 未结 3 714
無奈伤痛
無奈伤痛 2021-02-04 14:04

I am trying to use HBase as a data source for spark. So the first step turns out to be creating a RDD from a HBase table. Since Spark works with hadoop input formats, i could fi

相关标签:
3条回答
  • 2021-02-04 14:09

    Here is an example of using Scan in Spark:

    import java.io.{DataOutputStream, ByteArrayOutputStream}
    import java.lang.String
    import org.apache.hadoop.hbase.client.Scan
    import org.apache.hadoop.hbase.HBaseConfiguration
    import org.apache.hadoop.hbase.io.ImmutableBytesWritable
    import org.apache.hadoop.hbase.client.Result
    import org.apache.hadoop.hbase.mapreduce.TableInputFormat
    import org.apache.hadoop.hbase.util.Base64
    
    def convertScanToString(scan: Scan): String = {
      val out: ByteArrayOutputStream = new ByteArrayOutputStream
      val dos: DataOutputStream = new DataOutputStream(out)
      scan.write(dos)
      Base64.encodeBytes(out.toByteArray)
    }
    
    val conf = HBaseConfiguration.create()
    val scan = new Scan()
    scan.setCaching(500)
    scan.setCacheBlocks(false)
    conf.set(TableInputFormat.INPUT_TABLE, "table_name")
    conf.set(TableInputFormat.SCAN, convertScanToString(scan))
    val rdd = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat], classOf[ImmutableBytesWritable], classOf[Result])
    rdd.count
    

    You need to add related libraries to the Spark classpath and make sure they are compatible with your Spark. Tips: you can use hbase classpath to find them.

    0 讨论(0)
  • 2021-02-04 14:28

    You can set below conf

     val conf = HBaseConfiguration.create()//need to set all param for habse
     conf.set(TableInputFormat.SCAN_ROW_START, "row2");
     conf.set(TableInputFormat.SCAN_ROW_STOP, "stoprowkey");
    

    this will load rdd only for those reocrds

    0 讨论(0)
  • 2021-02-04 14:29

    Here is a Java example with TableMapReduceUtil.convertScanToString(Scan scan):

    import org.apache.hadoop.conf.Configuration;
    import org.apache.hadoop.hbase.HBaseConfiguration;
    import org.apache.hadoop.hbase.HConstants;
    import org.apache.hadoop.hbase.client.Result;
    import org.apache.hadoop.hbase.client.Scan;
    import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
    import org.apache.hadoop.hbase.mapreduce.TableInputFormat;
    import org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil;
    import org.apache.spark.SparkConf;
    import org.apache.spark.api.java.JavaPairRDD;
    import org.apache.spark.api.java.JavaSparkContext;
    
    import java.io.IOException;
    
    public class HbaseScan {
    
        public static void main(String ... args) throws IOException, InterruptedException {
    
            // Spark conf
            SparkConf sparkConf = new SparkConf().setMaster("local[4]").setAppName("My App");
            JavaSparkContext jsc = new JavaSparkContext(sparkConf);
    
            // Hbase conf
            Configuration conf = HBaseConfiguration.create();
            conf.set(TableInputFormat.INPUT_TABLE, "big_table_name");
    
            // Create scan
            Scan scan = new Scan();
            scan.setCaching(500);
            scan.setCacheBlocks(false);
            scan.setStartRow(Bytes.toBytes("a"));
            scan.setStopRow(Bytes.toBytes("d"));
    
    
            // Submit scan into hbase conf
            conf.set(TableInputFormat.SCAN, TableMapReduceUtil.convertScanToString(scan));
    
            // Get RDD
            JavaPairRDD<ImmutableBytesWritable, Result> source = jsc
                    .newAPIHadoopRDD(conf, TableInputFormat.class,
                            ImmutableBytesWritable.class, Result.class);
    
            // Process RDD
            System.out.println(source.count());
        }
    }
    
    0 讨论(0)
提交回复
热议问题