why hbase KeyValueSortReducer need to sort all KeyValue

问题

I am learning Phoenix CSV Bulk Load recently and I found that the source code of org.apache.phoenix.mapreduce.CsvToKeyValueReducer will cause OOM ( java heap out of memory ) when columns are large in one row (In my case, 44 columns in one row and the avg size of one row is 4KB).

What's more, this class is similar with the hbase bulk load reducer class - KeyValueSortReducer. It means that OOM may happen when using KeyValueSortReducer in my case.

So, I have a question of KeyValueSortReducer - why it need to sort all kvs in treeset first and then write all of them to context? If I remove the treeset sorting code and wirte all kvs directly to the context, the result will be different or be wrong ?

I am looking forward to your reply. Best wish to you!

here is the source code of KeyValueSortReducer:

public class KeyValueSortReducer extends Reducer<ImmutableBytesWritable, KeyValue, ImmutableBytesWritable, KeyValue> {
  protected void reduce(ImmutableBytesWritable row, java.lang.Iterable<KeyValue> kvs,
      org.apache.hadoop.mapreduce.Reducer<ImmutableBytesWritable, KeyValue, ImmutableBytesWritable, KeyValue>.Context context)
  throws java.io.IOException, InterruptedException {
    TreeSet<KeyValue> map = new TreeSet<KeyValue>(KeyValue.COMPARATOR);
    for (KeyValue kv: kvs) {
      try {
        map.add(kv.clone());
      } catch (CloneNotSupportedException e) {
        throw new java.io.IOException(e);
      }
    }
    context.setStatus("Read " + map.getClass());
    int index = 0;
    for (KeyValue kv: map) {
      context.write(row, kv);
      if (++index % 100 == 0) context.setStatus("Wrote " + index);
    }
  }
}

回答1:

please have a look in to this case study. there are some requirements where you need to order keyvalue pairs into the same row in the HFile.

回答2:

1.The main question : why hbase KeyValueSortReducer need to sort all KeyValue ?

Thanks to RamPrasad G's reply, we can look into the case study : http://www.deerwalk.com/blog/bulk-importing-data/

This case study will tell us more about hbase bulk import and the reducer class - KeyValueSortReducer. The reason of sorting all KeyValue in KeyValueSortReducer reduce method is that the HFile need this sorting. you can focus on the section :

A frequently occurring problem while reducing is lexical ordering. It happens when keyvalue list to be outputted from reducer is not sorted. One example is when qualifier names for a single row are not written in lexically increasing order. Another being when multiple rows are written in same reduce method and row id’s are not written in lexically increasing order. It happens because reducer output is never sorted. All sorting occurs on keyvalue outputted by mapper and before it enters reduce method. So, it tries to add keyvalue’s outputted from reduce method in incremental fashion assuming that it is presorted. So, before keyvalue’s are written into context, they must be added into sorting list like TreeSet or HashSet with KeyValue.COMPARATOR as comparator and then writing them in order specified by sorted list.

So, when your columns is very large, it will use a lot of memory for sorting. As the source code of KeyValueSortReducer memtioned :

/**
 * Emits sorted KeyValues.
 * Reads in all KeyValues from passed Iterator, sorts them, then emits
 * KeyValues in sorted order.  If lots of columns per row, it will use lots of
 * memory sorting.
 * @see HFileOutputFormat
 */

2.The referenced question : why Phoenix CSV BulkLoad reducer casue OOM ?

The reason of Phoenix CSV BulkLoad reducer casue OOM is the issue refer to PHOENIX-2649. Due to the Comparator inside CsvTableRowKeyPair error to compare two CsvTableRowKeyPair and make all rows to pass by one single reducer in one single reduce call, it will cause OOM quickly in my case.

Fortunately, Phoenix Team had fixed this issue upon the version of 4.7. If your phoenix version is under 4.7, please note about it and try to update your version, or you can make a patch to your version.

I hope this answer will help you !

来源：https://stackoverflow.com/questions/37047145/why-hbase-keyvaluesortreducer-need-to-sort-all-keyvalue

标签

Hadoop

hbase

phoenix

bulk-load