Why does Hadoop need to introduce these new classes? They just seem to complicate the interface
Because in a big data world, structured objects need to be serialized to a byte stream for moving over the network or persisting to disk on the cluster...and then deserialized back again as needed. When you have vast amounts of data at like Facebook scale to store and move, your data need to be efficient and take as little space to store and time to move as possible.
String
and Integer
are simply too "fat." Text
and IntWritable
, respectively, provide a much easier abstraction on top of byte arrays representing the same type of information.
In order to handle the Objects in Hadoop way. For example, hadoop uses Text
instead of java's String
. The Text
class in hadoop is similar to a java String
, however, Text
implements interfaces like Comparable
, Writable
and WritableComparable
.
These interfaces are all necessary for MapReduce; the Comparable
interface is used for comparing when the reducer sorts the keys, and Writable
can write the result to the local disk. It does not use the java Serializable
because java Serializable
is too big or too heavy for hadoop, Writable
can serializable the hadoop Object in a very light way.
From Apache documentation page:
Writable
interface is described as
A serializable object which implements a simple, efficient, serialization protocol, based on
DataInput
andDataOutput
.
With this new API, you don't have complications. Serialization process with these new classes is crisp
and compact
.
For effectiveness of Hadoop, the serialization/de-serialization process should be optimized because huge number of remote calls happen between the nodes in the cluster. So the serialization format should be fast, compact, extensible and interoperable. Due to this reason, Hadoop framework has come up with one IO classes to replace java primitive data types. e.g. IntWritbale
for int
, LongWritable
for long
, Text
for String
etc.
You can find more details about this topic in Hadoop The definitive guide : 4th Edition
Some more good info:
they’ve got two features that are relevant
they have the “Writable” interface -they know how to write to a DataOutput stream and read from a DataInput stream -explicitly.
they have their contents updates via the set() operation. This lets you reuse the same value, repeatedly, without creating new instances. It’s a lot more efficient if the same mapper or reducer is called repeatedly: you just create your instances of the writables in the constructor and reuse them
In comparison, Java’s Serializable framework “magically” serializes objects -but it does it in a way that is a bit brittle and is generally impossible to read in values generated by older versions of a class. the Java Object stream is designed to send a graph of objects back -it has to remember every object reference pushed out already, and do the same on the way back. The writables are designed to be self contained.
This is from: http://hortonworks.com/community/forums/topic/why-hadoop-uses-default-longwritable-or-intwritable/