Can someone explain how MapReduce works with Cassandra .6? I\'ve read through the word count example, but I don\'t quite follow what\'s happening on the Cassandra end vs. the \"
The win of using a direct InputFormat from cassandra is that it streams the data efficiently, which is a very big win. Each input split covers a range of tokens and rolls off the disk at its full bandwidth: no seeking, no complex querying. I don't think it knows about locality -- to have each tasktracker prefer input splits from a cassandra process on the same node.
You can try using Pig with the STREAM method as a hack until more direct hadoop streaming support is in place.