HBase-Spark Connector: connection to HBase established for every scan?

后端 未结 1 1038
情话喂你
情话喂你 2021-01-26 21:08

I am using Cloudera\'s HBase-Spark connector to do intensive HBase or BigTable scans. It works OK, but looking at Spark\'s detailed logs, it looks like the code tries to re-esta

1条回答
  •  不知归路
    2021-01-26 21:11

    This is a common problem. The cost of creating a connection can dwarf the actual work you're doing.

    In Cloud Bigtable, you can set google.bigtable.use.cached.data.channel.pool to true in your configuration settings. That would significantly improve performance. Cloud Bigtable ultimately uses a single HTTP/2 end point for all of your Cloud Bigtable instances.

    I don't know of a similar construct in HBase, but one way to do this would would suggest creating an implementation of Connection that creates a single cached Connection under the covers. You would have to set the hbase.client.connection.impl to your new class.

    0 讨论(0)
提交回复
热议问题