Integrate key-value database with Spark

馋奶兔 提交于 2019-12-10 12:44:48

问题


I'm having trouble understanding how Spark interacts with storage.

I would like to make a Spark cluster that fetches data from a RocksDB database (or any other key-value store). However, at this moment, the best I can do is fetch the whole dataset from the database into memory in each of the cluster nodes (into a map for example) and build an RDD from that object.

What do I have to do to fetch only the necessary data (like Spark does with HDFS)? I've read about Hadoop Input Format and Record Readers, but I'm not completely grasping what I should implement.

I know this is kind of a broad question, but I would really appreciate some help to get me started. Thank you in advance.


回答1:


Here is one possible solution. I assume you have client library for the key-value store(RocksDB in your case) that you want to access.
KeyValuePair represents a bean class representing one Key-value pair from your key-value store.

Classes

/*Lazy iterator to read from KeyValue store*/
class KeyValueIterator implements Iterator<KeyValuePair> {
    public KeyValueIterator() {
        //TODO initialize your custom reader using java client library
    }
    @Override
    public boolean hasNext() {
        //TODO
    }

    @Override
    public KeyValuePair next() {
        //TODO
    }
}
class KeyValueReader implements FlatMapFunction<KeyValuePair, KeyValuePair>() {
    @Override
    public Iterator<KeyValuePair> call(KeyValuePair keyValuePair) throws Exception {
        //ignore empty 'keyValuePair' object
        return new KeyValueIterator();
    }
}

Create KeyValue RDD

/*list with a dummy KeyValuePair instance*/
ArrayList<KeyValuePair> keyValuePairs = new ArrayList<>();
keyValuePairs.add(new KeyValuePair());
JavaRDD<KeyValuePair> keyValuePairRDD = javaSparkContext.parallelize(keyValuePairs);
/*Read one key-value pair at a time lazily*/    
keyValuePairRDD = keyValuePairRDD.flatMap(new KeyValueReader());

Note:

Above solution creates an RDD with two partitions by default(one of them will be empty). Increase the partitions before applying any transformation on keyValuePairRDD to distribute the processing across executors. Different ways to increase partitions:

keyValuePairRDD.repartition(partitionCounts)
//OR
keyValuePairRDD.partitionBy(...)


来源:https://stackoverflow.com/questions/41064850/integrate-key-value-database-with-spark

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!