Issue iterating over custom writable component in reducer

问题

I am using a custom writable class as VALUEOUT in the map phase in my MR job where the class has two fields, A org.apache.hadoop.io.Text and org.apache.hadoop.io.MapWritable. In my reduce function I iterate through the values for each key and I perform two operations, 1. filter, 2. aggregate. In the filter, I have some rules to check if certain values in the MapWritable(with key as Text and value as IntWritable or DoubleWritable) satisfy certain conditions and then I simply add them to an ArrayList. At the end of the filter operation, I have a filtered list of my custom writable objects. At the aggregate phase, when I access the objects, it turns out that the last object that was successfully filtered in, has overwritten all other objects in the arraylist. After going through some similar issues with lists on SO where the last object overwrite all the others, I confirmed that I do not have static fields nor am I reusing the same custom writable by setting different values(which was quoted as the possible reasons for such an issue). For each key in the reducer I have made sure that the CustomWritable, Text key and the MapWritable are new objects.

In addition, I also performed a simple test by eliminating the filter & aggregate operations in my reduce and just iterated through the values and added them to an ArrayList using a for loop. In the loop, everytime I added a CustomWritable into the list, I logged the values of all the contents of the List. I logged before and after adding the element to the list. Both logs presented that the previous set of elements have been overwritten. I am confused on how this could even happen. As soon as the next element in the iterable of values was accessed by the loop for ( CustomWritable result : values ), the list content was modified. I am unable to figure out the reason for this behaviour. If anyone can shed some light on this, it would be really helpful. Thanks.

回答1:

The"values" iterator in the reducer reuses the value as you iterate. It's a technique used for performance and smaller memory footprint. Behind the scenes, Hadoop deserializes the next record into the same Java object. If you need to "remember" an object, you'll need to clone it.

You can take advantage of the Writable interface and use the raw bytes to populate a new object.

IntWritable first = WritableUtils.clone(values.next(), context.getConfiguration());
IntWritable second = WritableUtils.clone(values.next(), context.getConfiguration());

来源：https://stackoverflow.com/questions/45871745/issue-iterating-over-custom-writable-component-in-reducer

标签

java

Hadoop

MapReduce

iterable