I am using Cloudera's HBase-Spark connector to do intensive HBase or BigTable scans. It works OK, but looking at Spark's detailed logs, it looks like the code tries to re-establish a connection to HBase with every call to process the results of a Scan()
which I do via the JavaHBaseContext.foreachPartition()
.
Am I right to think that this code re-establishes a connection to HBase every time? If so, how can I re-write it to make sure I reuse the already established connection?
Here's the full sample code that produces this behavior:
import org.apache.hadoop.hbase.client.ConnectionFactory;
import org.apache.hadoop.hbase.client.Connection;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.TableName;
import org.apache.hadoop.hbase.client.Result;
import org.apache.hadoop.hbase.client.Scan;
import org.apache.hadoop.hbase.filter.FilterList;
import org.apache.hadoop.hbase.filter.FirstKeyOnlyFilter;
import org.apache.hadoop.hbase.filter.KeyOnlyFilter;
import org.apache.hadoop.hbase.filter.PageFilter;
import org.apache.hadoop.hbase.filter.PrefixFilter;
import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
import org.apache.hadoop.hbase.spark.JavaHBaseContext;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.VoidFunction;
import scala.Tuple2;
import java.util.Iterator;
public class Main
{
public static void main(String args[]) throws Exception
{
SparkConf sc = new SparkConf().setAppName(Main.class.toString()).setMaster("local");
Configuration hBaseConf = HBaseConfiguration.create();
Connection hBaseConn = ConnectionFactory.createConnection(hBaseConf);
JavaSparkContext jSPContext = new JavaSparkContext(sc);
JavaHBaseContext hBaseContext = new JavaHBaseContext(jSPContext, hBaseConf);
int numTries = 5;
byte rowKey[] = "ffec939d-bb21-4525-b1ff-f3143faae2".getBytes();
for(int i = 0; i < numTries; i++)
{
Scan s = new Scan(rowKey);
FilterList fList = new FilterList(FilterList.Operator.MUST_PASS_ALL);
fList.addFilter(new KeyOnlyFilter());
fList.addFilter(new FirstKeyOnlyFilter());
fList.addFilter(new PageFilter(5));
fList.addFilter(new PrefixFilter(rowKey));
s.setFilter(fList);
s.setCaching(5);
JavaRDD<Tuple2<ImmutableBytesWritable, Result>> scanRDD = hBaseContext
.hbaseRDD(hBaseConn.getTable(TableName.valueOf("FFUnits")).getName(), s);
hBaseContext.foreachPartition(scanRDD, new VoidFunction<Tuple2<Iterator<Tuple2<ImmutableBytesWritable,Result>>, Connection>>(){
private static final long serialVersionUID = 1L;
public void call(Tuple2<Iterator<Tuple2<ImmutableBytesWritable,Result>>, Connection> t) throws Exception{
while (t._1().hasNext())
System.out.println("\tCurrent row: " + new String(t._1().next()._1.get()));
}});
}
}
}
And here's the output from Spark Logs. This output repeats 5 times for each 5 iterations of the loop:
18/03/26 15:51:56 INFO client.ConnectionManager$HConnectionImplementation: Closing zookeeper sessionid=0x16261d615db0c5f
18/03/26 15:51:56 INFO zookeeper.ZooKeeper: Session: 0x16261d615db0c5f closed
18/03/26 15:51:56 INFO zookeeper.ClientCnxn: EventThread shut down
18/03/26 15:51:56 INFO executor.Executor: Finished task 0.0 in stage 3.0 (TID 3). 2044 bytes result sent to driver
18/03/26 15:51:56 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 3.0 (TID 3) in 300 ms on localhost (1/1)
18/03/26 15:51:56 INFO scheduler.DAGScheduler: ResultStage 3 (foreachPartition at HBaseContext.scala:98) finished in 0.301 s
18/03/26 15:51:56 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 3.0, whose tasks have all completed, from pool
18/03/26 15:51:56 INFO scheduler.DAGScheduler: Job 3 finished: foreachPartition at HBaseContext.scala:98, took 0.311925 s
18/03/26 15:51:56 INFO storage.MemoryStore: Block broadcast_9 stored as values in memory (estimated size 266.5 KB, free 1391.1 KB)
18/03/26 15:51:56 INFO storage.MemoryStore: Block broadcast_9_piece0 stored as bytes in memory (estimated size 20.7 KB, free 1411.8 KB)
18/03/26 15:51:56 INFO storage.BlockManagerInfo: Added broadcast_9_piece0 in memory on localhost:57171 (size: 20.7 KB, free: 457.8 MB)
18/03/26 15:51:56 INFO spark.SparkContext: Created broadcast 9 from NewHadoopRDD at NewHBaseRDD.scala:25
18/03/26 15:51:56 INFO zookeeper.RecoverableZooKeeper: Process identifier=hconnection-0xc412556 connecting to ZooKeeper ensemble=hbase-3:2181
18/03/26 15:51:56 INFO zookeeper.ZooKeeper: Initiating client connection, connectString=hbase-3:2181 sessionTimeout=90000 watcher=org.apache.hadoop.hbase.zookeeper.PendingWatcher@6f930e0
18/03/26 15:51:56 INFO zookeeper.ClientCnxn: Opening socket connection to server 144.240.189.35.bc.googleusercontent.com/35.189.240.144:2181. Will not attempt to authenticate using SASL (unknown error)
18/03/26 15:51:56 INFO zookeeper.ClientCnxn: Socket connection established to 144.240.189.35.bc.googleusercontent.com/35.189.240.144:2181, initiating session
18/03/26 15:51:56 INFO zookeeper.ClientCnxn: Session establishment complete on server 144.240.189.35.bc.googleusercontent.com/35.189.240.144:2181, sessionid = 0x16261d615db0c60, negotiated timeout = 90000
18/03/26 15:51:56 INFO util.RegionSizeCalculator: Calculating region sizes for table "FFUnits".
18/03/26 15:51:57 INFO client.ConnectionManager$HConnectionImplementation: Closing master protocol: MasterService
18/03/26 15:51:57 INFO client.ConnectionManager$HConnectionImplementation: Closing zookeeper sessionid=0x16261d615db0c60
18/03/26 15:51:57 INFO zookeeper.ZooKeeper: Session: 0x16261d615db0c60 closed
18/03/26 15:51:57 INFO zookeeper.ClientCnxn: EventThread shut down
18/03/26 15:51:57 INFO spark.SparkContext: Starting job: foreachPartition at HBaseContext.scala:98
18/03/26 15:51:57 INFO scheduler.DAGScheduler: Got job 4 (foreachPartition at HBaseContext.scala:98) with 1 output partitions
18/03/26 15:51:57 INFO scheduler.DAGScheduler: Final stage: ResultStage 4 (foreachPartition at HBaseContext.scala:98)
18/03/26 15:51:57 INFO scheduler.DAGScheduler: Parents of final stage: List()
18/03/26 15:51:57 INFO scheduler.DAGScheduler: Missing parents: List()
18/03/26 15:51:57 INFO scheduler.DAGScheduler: Submitting ResultStage 4 (MapPartitionsRDD[9] at map at HBaseContext.scala:427), which has no missing parents
18/03/26 15:51:57 INFO storage.MemoryStore: Block broadcast_10 stored as values in memory (estimated size 2.9 KB, free 1414.7 KB)
18/03/26 15:51:57 INFO storage.MemoryStore: Block broadcast_10_piece0 stored as bytes in memory (estimated size 1719.0 B, free 1416.4 KB)
18/03/26 15:51:57 INFO storage.BlockManagerInfo: Added broadcast_10_piece0 in memory on localhost:57171 (size: 1719.0 B, free: 457.8 MB)
18/03/26 15:51:57 INFO spark.SparkContext: Created broadcast 10 from broadcast at DAGScheduler.scala:1006
18/03/26 15:51:57 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from ResultStage 4 (MapPartitionsRDD[9] at map at HBaseContext.scala:427)
18/03/26 15:51:57 INFO scheduler.TaskSchedulerImpl: Adding task set 4.0 with 1 tasks
18/03/26 15:51:57 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 4.0 (TID 4, localhost, partition 0,ANY, 2611 bytes)
18/03/26 15:51:57 INFO executor.Executor: Running task 0.0 in stage 4.0 (TID 4)
18/03/26 15:51:57 INFO spark.NewHBaseRDD: Input split: HBase table split(table name: FFUnits, scan: GiJmZmVjOTM5ZC1iYjIxLTQ1MjUtYjFmZi1mMzE0M2ZhYWUyKqECCilvcmcuYXBhY2hlLmhhZG9v
cC5oYmFzZS5maWx0ZXIuRmlsdGVyTGlzdBLzAQgBEjIKLG9yZy5hcGFjaGUuaGFkb29wLmhiYXNl
LmZpbHRlci5LZXlPbmx5RmlsdGVyEgIIABI1CjFvcmcuYXBhY2hlLmhhZG9vcC5oYmFzZS5maWx0
ZXIuRmlyc3RLZXlPbmx5RmlsdGVyEgASLwopb3JnLmFwYWNoZS5oYWRvb3AuaGJhc2UuZmlsdGVy
LlBhZ2VGaWx0ZXISAggFElMKK29yZy5hcGFjaGUuaGFkb29wLmhiYXNlLmZpbHRlci5QcmVmaXhG
aWx0ZXISJAoiZmZlYzkzOWQtYmIyMS00NTI1LWIxZmYtZjMxNDNmYWFlMjgBQAGIAQU=, start row: ffec939d-bb21-4525-b1ff-f3143faae2, end row: , region location: 144.240.189.35.bc.googleusercontent.com, encoded region name: 2bce3b6bf780755d19fc4b610b17cf11)
18/03/26 15:51:57 INFO zookeeper.RecoverableZooKeeper: Process identifier=hconnection-0x46ac4a0 connecting to ZooKeeper ensemble=hbase-3:2181
18/03/26 15:51:57 INFO zookeeper.ZooKeeper: Initiating client connection, connectString=hbase-3:2181 sessionTimeout=90000 watcher=org.apache.hadoop.hbase.zookeeper.PendingWatcher@5a8a2d2
18/03/26 15:51:57 INFO zookeeper.ClientCnxn: Opening socket connection to server 144.240.189.35.bc.googleusercontent.com/35.189.240.144:2181. Will not attempt to authenticate using SASL (unknown error)
18/03/26 15:51:57 INFO zookeeper.ClientCnxn: Socket connection established to 144.240.189.35.bc.googleusercontent.com/35.189.240.144:2181, initiating session
18/03/26 15:51:57 INFO zookeeper.ClientCnxn: Session establishment complete on server 144.240.189.35.bc.googleusercontent.com/35.189.240.144:2181, sessionid = 0x16261d615db0c61, negotiated timeout = 90000
18/03/26 15:51:57 INFO mapreduce.TableInputFormatBase: Input split length: 4 M bytes.
Current row: ffec939d-bb21-4525-b1ff-f3143faae246*1*0049424a-5cea-46cb-a6b0-7c50d6465588
Current row: ffec939d-bb21-4525-b1ff-f3143faae246*1*0082054a-b86a-4263-9753-025c1b0607be
Current row: ffec939d-bb21-4525-b1ff-f3143faae246*1*00e21835-5dc6-4d82-8b8c-a4dcae4f14cd
Current row: ffec939d-bb21-4525-b1ff-f3143faae246*1*01129620-a599-4fb7-9e2f-3492df1d06a3
Current row: ffec939d-bb21-4525-b1ff-f3143faae246*1*035b3450-e523-4df6-a24f-11ebb29050f7
My hbse-site.xml file looks like this:
<configuration>
<property>
<name>hbase.zookeeper.quorum</name>
<value>hbase-3</value>
</property>
<property>
<name>hbase.zookeeper.property.clientPort</name>
<value>2181</value>
</property>
<property>
<name>timeout</name>
<value>5000</value>
</property>
</configuration>
I am using the following versions:
Spark v 1.6.2
HBase 1.3.1
Spark-HBase 1.2.0-cdh5.14.0
Thanks for any help and advice!
This is a common problem. The cost of creating a connection can dwarf the actual work you're doing.
In Cloud Bigtable, you can set google.bigtable.use.cached.data.channel.pool
to true
in your configuration settings. That would significantly improve performance. Cloud Bigtable ultimately uses a single HTTP/2 end point for all of your Cloud Bigtable instances.
I don't know of a similar construct in HBase, but one way to do this would would suggest creating an implementation of Connection
that creates a single cached Connection
under the covers. You would have to set the hbase.client.connection.impl
to your new class.
来源:https://stackoverflow.com/questions/49494483/hbase-spark-connector-connection-to-hbase-established-for-every-scan