I am using Cloudera\'s HBase-Spark connector to do intensive HBase or BigTable scans. It works OK, but looking at Spark\'s detailed logs, it looks like the code tries to re-esta
This is a common problem. The cost of creating a connection can dwarf the actual work you're doing.
In Cloud Bigtable, you can set google.bigtable.use.cached.data.channel.pool
to true
in your configuration settings. That would significantly improve performance. Cloud Bigtable ultimately uses a single HTTP/2 end point for all of your Cloud Bigtable instances.
I don't know of a similar construct in HBase, but one way to do this would would suggest creating an implementation of Connection
that creates a single cached Connection
under the covers. You would have to set the hbase.client.connection.impl
to your new class.