Why does Hadoop need classes like Text or IntWritable instead of String or Integer?

后端未结

关注

 4  1323

名媛妹妹

Why does Hadoop need to introduce these new classes? They just seem to complicate the interface

相关标签:

4条回答

忘了有多久

2020-12-23 17:25

Because in a big data world, structured objects need to be serialized to a byte stream for moving over the network or persisting to disk on the cluster...and then deserialized back again as needed. When you have vast amounts of data at like Facebook scale to store and move, your data need to be efficient and take as little space to store and time to move as possible.

String and Integer are simply too "fat." Text and IntWritable, respectively, provide a much easier abstraction on top of byte arrays representing the same type of information.

0 讨论(0)
发布评论:

提交评论
- 加载中...
抹茶落季

2020-12-23 17:36

In order to handle the Objects in Hadoop way. For example, hadoop uses Text instead of java's String. The Text class in hadoop is similar to a java String, however, Text implements interfaces like Comparable, Writable and WritableComparable.

These interfaces are all necessary for MapReduce; the Comparable interface is used for comparing when the reducer sorts the keys, and Writable can write the result to the local disk. It does not use the java Serializable because java Serializable is too big or too heavy for hadoop, Writable can serializable the hadoop Object in a very light way.

0 讨论(0)
发布评论:

提交评论
- 加载中...
旧时难觅i

2020-12-23 17:37

From Apache documentation page:

Writable interface is described as

A serializable object which implements a simple, efficient, serialization protocol, based on DataInput and DataOutput.

With this new API, you don't have complications. Serialization process with these new classes is crisp and compact.

For effectiveness of Hadoop, the serialization/de-serialization process should be optimized because huge number of remote calls happen between the nodes in the cluster. So the serialization format should be fast, compact, extensible and interoperable. Due to this reason, Hadoop framework has come up with one IO classes to replace java primitive data types. e.g. IntWritbale for int, LongWritable for long, Text for String etc.

You can find more details about this topic in Hadoop The definitive guide : 4th Edition

0 讨论(0)
发布评论:

提交评论
- 加载中...
刺人心

2020-12-23 17:41

Some more good info:

they’ve got two features that are relevant

they have the “Writable” interface -they know how to write to a DataOutput stream and read from a DataInput stream -explicitly.

they have their contents updates via the set() operation. This lets you reuse the same value, repeatedly, without creating new instances. It’s a lot more efficient if the same mapper or reducer is called repeatedly: you just create your instances of the writables in the constructor and reuse them

In comparison, Java’s Serializable framework “magically” serializes objects -but it does it in a way that is a bit brittle and is generally impossible to read in values generated by older versions of a class. the Java Object stream is designed to send a graph of objects back -it has to remember every object reference pushed out already, and do the same on the way back. The writables are designed to be self contained.

This is from: http://hortonworks.com/community/forums/topic/why-hadoop-uses-default-longwritable-or-intwritable/

0 讨论(0)
发布评论:

提交评论
- 加载中...