What is the fastest way to bulk load data into HBase programmatically?

前端 未结 2 1044
有刺的猬
有刺的猬 2020-12-07 23:36

I have a Plain text file with possibly millions of lines which needs custom parsing and I want to load it into an HBase table as fast as possible (using Hadoop or HBase Java

相关标签:
2条回答
  • 2020-12-07 23:53

    One interesting thing is that during insertion of 1,000,000 rows, 25 Mappers (tasks) are spawned but they run serially (one after another); is this normal?

    mapreduce.tasktracker.map.tasks.maximum parameter which is defaulted to 2 determines the maximum number of tasks that can run in parallel on a node. Unless changed, you should see 2 map tasks running simultaneously on each node.

    0 讨论(0)
  • 2020-12-08 00:02

    I've gone through a process that is probably very similar to yours of attempting to find an efficient way to load data from an MR into HBase. What I found to work is using HFileOutputFormat as the OutputFormatClass of the MR.

    Below is the basis of my code that I have to generate the job and the Mapper map function which writes out the data. This was fast. We don't use it anymore, so I don't have numbers on hand, but it was around 2.5 million records in under a minute.

    Here is the (stripped down) function I wrote to generate the job for my MapReduce process to put data into HBase

    private Job createCubeJob(...) {
        //Build and Configure Job
        Job job = new Job(conf);
        job.setJobName(jobName);
        job.setMapOutputKeyClass(ImmutableBytesWritable.class);
        job.setMapOutputValueClass(Put.class);
        job.setMapperClass(HiveToHBaseMapper.class);//Custom Mapper
        job.setJarByClass(CubeBuilderDriver.class);
        job.setInputFormatClass(TextInputFormat.class);
        job.setOutputFormatClass(HFileOutputFormat.class);
    
        TextInputFormat.setInputPaths(job, hiveOutputDir);
        HFileOutputFormat.setOutputPath(job, cubeOutputPath);
    
        Configuration hConf = HBaseConfiguration.create(conf);
        hConf.set("hbase.zookeeper.quorum", hbaseZookeeperQuorum);
        hConf.set("hbase.zookeeper.property.clientPort", hbaseZookeeperClientPort);
    
        HTable hTable = new HTable(hConf, tableName);
    
        HFileOutputFormat.configureIncrementalLoad(job, hTable);
        return job;
    }
    

    This is my map function from the HiveToHBaseMapper class (slightly edited ).

    public void map(WritableComparable key, Writable val, Context context)
            throws IOException, InterruptedException {
        try{
            Configuration config = context.getConfiguration();
            String[] strs = val.toString().split(Constants.HIVE_RECORD_COLUMN_SEPARATOR);
            String family = config.get(Constants.CUBEBUILDER_CONFIGURATION_FAMILY);
            String column = strs[COLUMN_INDEX];
            String Value = strs[VALUE_INDEX];
            String sKey = generateKey(strs, config);
            byte[] bKey = Bytes.toBytes(sKey);
            Put put = new Put(bKey);
            put.add(Bytes.toBytes(family), Bytes.toBytes(column), (value <= 0) 
                            ? Bytes.toBytes(Double.MIN_VALUE)
                            : Bytes.toBytes(value));
    
            ImmutableBytesWritable ibKey = new ImmutableBytesWritable(bKey);
            context.write(ibKey, put);
    
            context.getCounter(CubeBuilderContextCounters.CompletedMapExecutions).increment(1);
        }
        catch(Exception e){
            context.getCounter(CubeBuilderContextCounters.FailedMapExecutions).increment(1);    
        }
    
    }
    

    I pretty sure this isn't going to be a Copy&Paste solution for you. Obviously the data I was working with here didn't need any custom processing (that was done in a MR job before this one). The main thing I want to provide out of this is the HFileOutputFormat. The rest is just an example of how I used it. :)
    I hope it gets you onto a solid path to a good solution. :

    0 讨论(0)
提交回复
热议问题