HBase-Spark Connector: connection to HBase established for every scan?

大憨熊 提交于 2019-12-02 11:41:41

问题


I am using Cloudera's HBase-Spark connector to do intensive HBase or BigTable scans. It works OK, but looking at Spark's detailed logs, it looks like the code tries to re-establish a connection to HBase with every call to process the results of a Scan() which I do via the JavaHBaseContext.foreachPartition().

Am I right to think that this code re-establishes a connection to HBase every time? If so, how can I re-write it to make sure I reuse the already established connection?

Here's the full sample code that produces this behavior:

import org.apache.hadoop.hbase.client.ConnectionFactory;
import org.apache.hadoop.hbase.client.Connection;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.TableName;
import org.apache.hadoop.hbase.client.Result;
import org.apache.hadoop.hbase.client.Scan;
import org.apache.hadoop.hbase.filter.FilterList;
import org.apache.hadoop.hbase.filter.FirstKeyOnlyFilter;
import org.apache.hadoop.hbase.filter.KeyOnlyFilter;
import org.apache.hadoop.hbase.filter.PageFilter;
import org.apache.hadoop.hbase.filter.PrefixFilter;
import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
import org.apache.hadoop.hbase.spark.JavaHBaseContext;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.VoidFunction;

import scala.Tuple2;

import java.util.Iterator;

public class Main
{   
    public static void main(String args[]) throws Exception
    {

        SparkConf sc = new SparkConf().setAppName(Main.class.toString()).setMaster("local");        
        Configuration hBaseConf = HBaseConfiguration.create();
        Connection hBaseConn = ConnectionFactory.createConnection(hBaseConf);

        JavaSparkContext jSPContext = new JavaSparkContext(sc);
        JavaHBaseContext hBaseContext = new JavaHBaseContext(jSPContext, hBaseConf);

        int numTries = 5;
        byte rowKey[] = "ffec939d-bb21-4525-b1ff-f3143faae2".getBytes();
        for(int i = 0; i < numTries; i++)
        {
            Scan s = new Scan(rowKey);
            FilterList fList = new FilterList(FilterList.Operator.MUST_PASS_ALL);
            fList.addFilter(new KeyOnlyFilter());
            fList.addFilter(new FirstKeyOnlyFilter());
            fList.addFilter(new PageFilter(5));
            fList.addFilter(new PrefixFilter(rowKey));
            s.setFilter(fList);
            s.setCaching(5);            

            JavaRDD<Tuple2<ImmutableBytesWritable, Result>> scanRDD = hBaseContext
                    .hbaseRDD(hBaseConn.getTable(TableName.valueOf("FFUnits")).getName(), s);   

            hBaseContext.foreachPartition(scanRDD,  new VoidFunction<Tuple2<Iterator<Tuple2<ImmutableBytesWritable,Result>>, Connection>>(){
                private static final long serialVersionUID = 1L;
                public void call(Tuple2<Iterator<Tuple2<ImmutableBytesWritable,Result>>, Connection> t) throws Exception{
                    while (t._1().hasNext())
                        System.out.println("\tCurrent row: " + new String(t._1().next()._1.get()));
                }});
        }
    }
}

And here's the output from Spark Logs. This output repeats 5 times for each 5 iterations of the loop:

18/03/26 15:51:56 INFO client.ConnectionManager$HConnectionImplementation: Closing zookeeper sessionid=0x16261d615db0c5f
18/03/26 15:51:56 INFO zookeeper.ZooKeeper: Session: 0x16261d615db0c5f closed
18/03/26 15:51:56 INFO zookeeper.ClientCnxn: EventThread shut down
18/03/26 15:51:56 INFO executor.Executor: Finished task 0.0 in stage 3.0 (TID 3). 2044 bytes result sent to driver
18/03/26 15:51:56 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 3.0 (TID 3) in 300 ms on localhost (1/1)
18/03/26 15:51:56 INFO scheduler.DAGScheduler: ResultStage 3 (foreachPartition at HBaseContext.scala:98) finished in 0.301 s
18/03/26 15:51:56 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 3.0, whose tasks have all completed, from pool 
18/03/26 15:51:56 INFO scheduler.DAGScheduler: Job 3 finished: foreachPartition at HBaseContext.scala:98, took 0.311925 s
18/03/26 15:51:56 INFO storage.MemoryStore: Block broadcast_9 stored as values in memory (estimated size 266.5 KB, free 1391.1 KB)
18/03/26 15:51:56 INFO storage.MemoryStore: Block broadcast_9_piece0 stored as bytes in memory (estimated size 20.7 KB, free 1411.8 KB)
18/03/26 15:51:56 INFO storage.BlockManagerInfo: Added broadcast_9_piece0 in memory on localhost:57171 (size: 20.7 KB, free: 457.8 MB)
18/03/26 15:51:56 INFO spark.SparkContext: Created broadcast 9 from NewHadoopRDD at NewHBaseRDD.scala:25
18/03/26 15:51:56 INFO zookeeper.RecoverableZooKeeper: Process identifier=hconnection-0xc412556 connecting to ZooKeeper ensemble=hbase-3:2181
18/03/26 15:51:56 INFO zookeeper.ZooKeeper: Initiating client connection, connectString=hbase-3:2181 sessionTimeout=90000 watcher=org.apache.hadoop.hbase.zookeeper.PendingWatcher@6f930e0
18/03/26 15:51:56 INFO zookeeper.ClientCnxn: Opening socket connection to server 144.240.189.35.bc.googleusercontent.com/35.189.240.144:2181. Will not attempt to authenticate using SASL (unknown error)
18/03/26 15:51:56 INFO zookeeper.ClientCnxn: Socket connection established to 144.240.189.35.bc.googleusercontent.com/35.189.240.144:2181, initiating session
18/03/26 15:51:56 INFO zookeeper.ClientCnxn: Session establishment complete on server 144.240.189.35.bc.googleusercontent.com/35.189.240.144:2181, sessionid = 0x16261d615db0c60, negotiated timeout = 90000
18/03/26 15:51:56 INFO util.RegionSizeCalculator: Calculating region sizes for table "FFUnits".
18/03/26 15:51:57 INFO client.ConnectionManager$HConnectionImplementation: Closing master protocol: MasterService
18/03/26 15:51:57 INFO client.ConnectionManager$HConnectionImplementation: Closing zookeeper sessionid=0x16261d615db0c60
18/03/26 15:51:57 INFO zookeeper.ZooKeeper: Session: 0x16261d615db0c60 closed
18/03/26 15:51:57 INFO zookeeper.ClientCnxn: EventThread shut down
18/03/26 15:51:57 INFO spark.SparkContext: Starting job: foreachPartition at HBaseContext.scala:98
18/03/26 15:51:57 INFO scheduler.DAGScheduler: Got job 4 (foreachPartition at HBaseContext.scala:98) with 1 output partitions
18/03/26 15:51:57 INFO scheduler.DAGScheduler: Final stage: ResultStage 4 (foreachPartition at HBaseContext.scala:98)
18/03/26 15:51:57 INFO scheduler.DAGScheduler: Parents of final stage: List()
18/03/26 15:51:57 INFO scheduler.DAGScheduler: Missing parents: List()
18/03/26 15:51:57 INFO scheduler.DAGScheduler: Submitting ResultStage 4 (MapPartitionsRDD[9] at map at HBaseContext.scala:427), which has no missing parents
18/03/26 15:51:57 INFO storage.MemoryStore: Block broadcast_10 stored as values in memory (estimated size 2.9 KB, free 1414.7 KB)
18/03/26 15:51:57 INFO storage.MemoryStore: Block broadcast_10_piece0 stored as bytes in memory (estimated size 1719.0 B, free 1416.4 KB)
18/03/26 15:51:57 INFO storage.BlockManagerInfo: Added broadcast_10_piece0 in memory on localhost:57171 (size: 1719.0 B, free: 457.8 MB)
18/03/26 15:51:57 INFO spark.SparkContext: Created broadcast 10 from broadcast at DAGScheduler.scala:1006
18/03/26 15:51:57 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from ResultStage 4 (MapPartitionsRDD[9] at map at HBaseContext.scala:427)
18/03/26 15:51:57 INFO scheduler.TaskSchedulerImpl: Adding task set 4.0 with 1 tasks
18/03/26 15:51:57 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 4.0 (TID 4, localhost, partition 0,ANY, 2611 bytes)
18/03/26 15:51:57 INFO executor.Executor: Running task 0.0 in stage 4.0 (TID 4)
18/03/26 15:51:57 INFO spark.NewHBaseRDD: Input split: HBase table split(table name: FFUnits, scan: GiJmZmVjOTM5ZC1iYjIxLTQ1MjUtYjFmZi1mMzE0M2ZhYWUyKqECCilvcmcuYXBhY2hlLmhhZG9v
cC5oYmFzZS5maWx0ZXIuRmlsdGVyTGlzdBLzAQgBEjIKLG9yZy5hcGFjaGUuaGFkb29wLmhiYXNl
LmZpbHRlci5LZXlPbmx5RmlsdGVyEgIIABI1CjFvcmcuYXBhY2hlLmhhZG9vcC5oYmFzZS5maWx0
ZXIuRmlyc3RLZXlPbmx5RmlsdGVyEgASLwopb3JnLmFwYWNoZS5oYWRvb3AuaGJhc2UuZmlsdGVy
LlBhZ2VGaWx0ZXISAggFElMKK29yZy5hcGFjaGUuaGFkb29wLmhiYXNlLmZpbHRlci5QcmVmaXhG
aWx0ZXISJAoiZmZlYzkzOWQtYmIyMS00NTI1LWIxZmYtZjMxNDNmYWFlMjgBQAGIAQU=, start row: ffec939d-bb21-4525-b1ff-f3143faae2, end row: , region location: 144.240.189.35.bc.googleusercontent.com, encoded region name: 2bce3b6bf780755d19fc4b610b17cf11)
18/03/26 15:51:57 INFO zookeeper.RecoverableZooKeeper: Process identifier=hconnection-0x46ac4a0 connecting to ZooKeeper ensemble=hbase-3:2181
18/03/26 15:51:57 INFO zookeeper.ZooKeeper: Initiating client connection, connectString=hbase-3:2181 sessionTimeout=90000 watcher=org.apache.hadoop.hbase.zookeeper.PendingWatcher@5a8a2d2
18/03/26 15:51:57 INFO zookeeper.ClientCnxn: Opening socket connection to server 144.240.189.35.bc.googleusercontent.com/35.189.240.144:2181. Will not attempt to authenticate using SASL (unknown error)
18/03/26 15:51:57 INFO zookeeper.ClientCnxn: Socket connection established to 144.240.189.35.bc.googleusercontent.com/35.189.240.144:2181, initiating session
18/03/26 15:51:57 INFO zookeeper.ClientCnxn: Session establishment complete on server 144.240.189.35.bc.googleusercontent.com/35.189.240.144:2181, sessionid = 0x16261d615db0c61, negotiated timeout = 90000
18/03/26 15:51:57 INFO mapreduce.TableInputFormatBase: Input split length: 4 M bytes.
    Current row: ffec939d-bb21-4525-b1ff-f3143faae246*1*0049424a-5cea-46cb-a6b0-7c50d6465588
    Current row: ffec939d-bb21-4525-b1ff-f3143faae246*1*0082054a-b86a-4263-9753-025c1b0607be
    Current row: ffec939d-bb21-4525-b1ff-f3143faae246*1*00e21835-5dc6-4d82-8b8c-a4dcae4f14cd
    Current row: ffec939d-bb21-4525-b1ff-f3143faae246*1*01129620-a599-4fb7-9e2f-3492df1d06a3
    Current row: ffec939d-bb21-4525-b1ff-f3143faae246*1*035b3450-e523-4df6-a24f-11ebb29050f7

My hbse-site.xml file looks like this:

<configuration>
  <property>
    <name>hbase.zookeeper.quorum</name>
    <value>hbase-3</value>
  </property>
  <property>
    <name>hbase.zookeeper.property.clientPort</name>
    <value>2181</value>
  </property>
  <property>
    <name>timeout</name>
    <value>5000</value>
  </property>
</configuration>

I am using the following versions:

Spark v 1.6.2
HBase 1.3.1
Spark-HBase 1.2.0-cdh5.14.0

Thanks for any help and advice!


回答1:


This is a common problem. The cost of creating a connection can dwarf the actual work you're doing.

In Cloud Bigtable, you can set google.bigtable.use.cached.data.channel.pool to true in your configuration settings. That would significantly improve performance. Cloud Bigtable ultimately uses a single HTTP/2 end point for all of your Cloud Bigtable instances.

I don't know of a similar construct in HBase, but one way to do this would would suggest creating an implementation of Connection that creates a single cached Connection under the covers. You would have to set the hbase.client.connection.impl to your new class.



来源:https://stackoverflow.com/questions/49494483/hbase-spark-connector-connection-to-hbase-established-for-every-scan

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!