How to process a range of hbase rows using spark?

后端 未结 3 725
無奈伤痛
無奈伤痛 2021-02-04 14:04

I am trying to use HBase as a data source for spark. So the first step turns out to be creating a RDD from a HBase table. Since Spark works with hadoop input formats, i could fi

3条回答
  •  面向向阳花
    2021-02-04 14:29

    Here is a Java example with TableMapReduceUtil.convertScanToString(Scan scan):

    import org.apache.hadoop.conf.Configuration;
    import org.apache.hadoop.hbase.HBaseConfiguration;
    import org.apache.hadoop.hbase.HConstants;
    import org.apache.hadoop.hbase.client.Result;
    import org.apache.hadoop.hbase.client.Scan;
    import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
    import org.apache.hadoop.hbase.mapreduce.TableInputFormat;
    import org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil;
    import org.apache.spark.SparkConf;
    import org.apache.spark.api.java.JavaPairRDD;
    import org.apache.spark.api.java.JavaSparkContext;
    
    import java.io.IOException;
    
    public class HbaseScan {
    
        public static void main(String ... args) throws IOException, InterruptedException {
    
            // Spark conf
            SparkConf sparkConf = new SparkConf().setMaster("local[4]").setAppName("My App");
            JavaSparkContext jsc = new JavaSparkContext(sparkConf);
    
            // Hbase conf
            Configuration conf = HBaseConfiguration.create();
            conf.set(TableInputFormat.INPUT_TABLE, "big_table_name");
    
            // Create scan
            Scan scan = new Scan();
            scan.setCaching(500);
            scan.setCacheBlocks(false);
            scan.setStartRow(Bytes.toBytes("a"));
            scan.setStopRow(Bytes.toBytes("d"));
    
    
            // Submit scan into hbase conf
            conf.set(TableInputFormat.SCAN, TableMapReduceUtil.convertScanToString(scan));
    
            // Get RDD
            JavaPairRDD source = jsc
                    .newAPIHadoopRDD(conf, TableInputFormat.class,
                            ImmutableBytesWritable.class, Result.class);
    
            // Process RDD
            System.out.println(source.count());
        }
    }
    

提交回复
热议问题