How to use Cassandra's Map Reduce with or w/o Pig?

后端 未结 3 1359
抹茶落季
抹茶落季 2021-02-13 05:31

Can someone explain how MapReduce works with Cassandra .6? I\'ve read through the word count example, but I don\'t quite follow what\'s happening on the Cassandra end vs. the \"

相关标签:
3条回答
  • 2021-02-13 06:09

    It Knows about the locality ; The Cassandra InputFormat overrides getLocations() to preserve data locality

    0 讨论(0)
  • 2021-02-13 06:11

    The win of using a direct InputFormat from cassandra is that it streams the data efficiently, which is a very big win. Each input split covers a range of tokens and rolls off the disk at its full bandwidth: no seeking, no complex querying. I don't think it knows about locality -- to have each tasktracker prefer input splits from a cassandra process on the same node.

    You can try using Pig with the STREAM method as a hack until more direct hadoop streaming support is in place.

    0 讨论(0)
  • 2021-02-13 06:22

    From what I've heard (and from here), the way that a developer writes a MapReduce program that uses Cassandra as the data source is as follows. You write a regular MapReduce program (the example you linked to is for the pure-Java version) and the jars that are now available provide a CustomInputFormat that allows the input source to be Cassandra (instead of the default, which is Hadoop).

    If you're using Pycassa I'd say you're out of luck until either (1) the maintainer of that project adds support for MapReduce or (2) you throw some Python functions together that write up a Java MapReduce program and run it. The latter is definitely a bit of a hack but would get you up and going.

    0 讨论(0)
提交回复
热议问题