does anyone find Cascading for Hadoop Map Reduce useful?

后端 未结 8 1555
天涯浪人
天涯浪人 2021-01-30 00:28

I\'ve been trying Cascading, but I cannot see any advantage over the classic map reduce approach for writing jobs.

Map Reduce jobs gives me more freedom and Cascading se

相关标签:
8条回答
  • 2021-01-30 01:08

    I've been using Cascading for a couple of years now. I find it to be extremely helpful. Ultimately, it's about Productivity gains. I can be much more efficient in creating and maintaining M/R jobs as compared to plain java code. Here's a few reasons why:

    • A lot of the boilerplate code used to start a job is already written for you.
    • Composability. Generally code is easier to read and easier to reuse when it is written as components (operations) which are stitched together to perform some more complex processing.
    • I find unit testing to be easier. There are examples in the cascading package demonstrating how to write simple unit tests to directly test the output of flows.
    • The Tap (source and sink) paradigm makes it easy to change the input and ouput of a job, so you can, for example, start with output to STDOUT for development and debugging and then switch to HDFS sequencefiles for batch jobs and then switch to an HBase tap for pseudo-real time updates.
    • Another great advantage of writing Cascading jobs is that you're really writing more of a factory that creates jobs. This can be a huge advantage when you need to build something dynamically (i.e. the results of one job control what subsequent jobs you create and run). Or, in another case, I needed to create a job for each combination of 6 binary variables. This is 64 jobs which are all very similar. This would be a hassle with just hadoop map reduce classes.

    While there are a lot of pre-built components that you can compose together, if a particular section of your processing logic seems like it would be easier to just write in straight Java, you can always create a Cascading function to wrap that. This allows you to have the benefits of Cascading, but very custom operations can be written as straight java functions (implementing a Cascading interface).

    0 讨论(0)
  • 2021-01-30 01:10

    I think that the place that Cascading's advantages begin to show are instances where you have a pile of simple functions that should all be kept separate in source code, but which can all be collected into a composition in your mapper or reducer. Putting them together makes your basic map-reduce code hard to read, separating them makes the program really slow. Cascading's optimizer can put them together even though you write them separately. Pig and to some extent Hive can do this as well, but for large programs, I think Cascading has a maintainability advantage.

    In a few months Plume may be an expressivity competitor, but if you have real programs to write and run in a production setting, then Cascading is probably your best bet.

    0 讨论(0)
提交回复
热议问题