I am trying to learn spark + scala. I want to read from HBase, but without mapreduce. I created a simple HBase table - \"test\" and did 3 puts in it. I want to read it via s
Your problem is that table
is not serializable (rather it's member conf
) and your trying to serialize it by using it inside a map
. They way your trying to read HBase isn't quite correct, it looks like your trying some specific Get's and then trying to do them in parallel. Even if you did get this working, this really wouldn't scale as your going to perform random reads. What you want to do is perform a table scan using Spark, here is a code snippet that should help you do it:
val conf = HBaseConfiguration.create()
conf.set(TableInputFormat.INPUT_TABLE, tableName)
sc.newAPIHadoopRDD(conf, classOf[TableInputFormat],
classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],
classOf[org.apache.hadoop.hbase.client.Result])
This will give you an RDD containing the NaviagableMap's that constitute the rows. Below is how you can change the NaviagbleMap to a normal Scala map of Strings:
...
.map(kv => (kv._1.get(), navMapToMap(kv._2.getMap)))
.map(kv => (Bytes.toString(kv._1), rowToStrMap(kv._2)))
def navMapToMap(navMap: HBaseRow): CFTimeseriesRow =
navMap.asScala.toMap.map(cf =>
(cf._1, cf._2.asScala.toMap.map(col =>
(col._1, col._2.asScala.toMap.map(elem => (elem._1.toLong, elem._2))))))
def rowToStrMap(navMap: CFTimeseriesRow): CFTimeseriesRowStr =
navMap.map(cf =>
(Bytes.toString(cf._1), cf._2.map(col =>
(Bytes.toString(col._1), col._2.map(elem => (elem._1, Bytes.toString(elem._2)))))))
Final point, if you really do want to try to perform random reads in parallel I believe you might be able to put the HBase table initialization inside the map
.
what happens when you do
@transient val conf = new HBaseConfiguration
UPDATE Apparently there are other parts of the HBase submitted task that are also not serializable. Each of these will need to be addressed.
Consider whether the entity will have the same meaning/semantics on both sides of the wire. Any connections will certainly not. The HBaseConfiguration should not be serialized. But primitives and simple objects built atop primitives - and not containing context-sensitive data - are fine to include in the serialization
For context-sensitive entities - including the HBaseConfiguration and any connection oriented data structures - you should mark them @transient and then in the readObject() method they should be instantiated with values relevant to the client environment.