So if there is a table in the database shown as below:
Key2 DateTimeAge
AAA1 XXX XXX XXX
AAA2 XXX XXX XXX
AAA3 XXX XXX XXX
AAA4 XXX XXX XXX
The difficult part is actually to setup the HBase connector either from Hortonworks or from Huawei.
But anyway I think you are asking about the query itself, so I have quickly built a toy example using Hive (i.e. creating the HBase table using the shell and then adding a create external table
in Hive).
Then I create a SQL context using the Hive context.
from pyspark.sql import HiveContext
sqlContext = HiveContext(sc)
The full toy table has 3 rows:
df = sqlContext.sql("select * from hbase_table_1")
df.show(3)
+----+--------+
| key|column_1|
+----+--------+
|AAA1| abcd|
|AAA2| efgh|
|BBB1| jklm|
+----+--------+
and to access a subset of the HBase rowkeys
:
df = sqlContext.sql("select * from hbase_table_1 where key >= 'AAA' and key < 'BBB'")
df.show(3)
+----+--------+
| key|column_1|
+----+--------+
|AAA1| abcd|
|AAA2| efgh|
+----+--------+
For performance you should definitively go for one of the HBase connectors, but once you have it (at least for Hortonworks') the query should be the same.