问题
From what little understanding of Cassandra I have, it seems that data locality is mostly transparent to the client application that accesses a node, as it should.
However, what if I explicitly only wanted to access the data of a column family that is local to the node I'm connected to? Is such a thing possible? I haven't found a way of getting this from a client API out-of-the-box, but it seems that I could get some of this information through the system tables, but I can't quite figure out how to do this.
The idea is to perform mapreduce, but without using Hadoop. A local client would connect to its local cassandra node, perform aggregation on the local data and then pass it back upstream.
Is such a thing possible at all? By the looks of it, it seems possible since I've seen evidence of Hadoop being able to use Cassandra, but the examples seem to be geared towards Hadoop rather than a generic client. The local client (the bit talking to Casandra) would be in Java. I'm currently using Hector, but I'm unsure whether it would provide any data locality information.
回答1:
A recent article on the Netflix Techblog introduces Aegisthus, a project which reads the SSTables stored on disk across the cluster and merges them into a single, consistent view of the data (in MapReduce). I would imagine that the mechanics would then trivially exist for generating a view of the data on a single node.
Unfortunately, I don't think they've open sourced this tool yet so you won't be able to use it. The most it can be at this point is a glimmer that yes it's possible to natively read SSTables using non-Cassandra code.
You may be able to hack something together using the Cassandra source that reads SSTables and have that feed the local client you're hoping to build. A great starting point would be looking at the source of org.apache.cassandra.tools.SSTableExport
which is used in the sstable2json
tool.
来源:https://stackoverflow.com/questions/9262745/how-to-access-the-local-data-of-a-cassandra-node