How to sort comma separated keys in Reducer ouput?

后端 未结 1 684
抹茶落季
抹茶落季 2021-01-14 16:45

I am running an RFM Analysis program using MapReduce. The OutputKeyClass is Text.class and I am emitting comma separated R (Recency), F (Frequency), M (Monetory) as the key

1条回答
  •  隐瞒了意图╮
    2021-01-14 17:12

    So after a lot of searching I found some useful material the compilation of which I am posting now:

    1. You have to start with your custom data type. Since I had three comma separated values which needed to be sorted descendingly, I had to create a TextQuadlet.java data type in Hadoop. The reason I am creating a quadlet is because the first part of the key will be the natural key and the rest of the three parts will be the R, F, M:

      import java.io.*;
      import org.apache.hadoop.io.*;
      public class TextQuadlet implements WritableComparable {
      private String customer_id;
      private long R;
      private long F;
      private double M;
      public TextQuadlet() {
      }
      public TextQuadlet(String customer_id, long R, long F, double M) {
          set(customer_id, R, F, M);
      }
      public void set(String customer_id2, long R2, long F2, double M2) {
          this.customer_id = customer_id2;
          this.R = R2;
          this.F = F2;
          this.M=M2;
      }
      public String getCustomer_id() {
          return customer_id;
      }
      public long getR() {
          return R;
      }
      public long getF() {
          return F;
      }
      public double getM() {
          return M;
      }
      @Override
      public void write(DataOutput out) throws IOException {
          out.writeUTF(this.customer_id);
          out.writeLong(this.R);
          out.writeLong(this.F);
          out.writeDouble(this.M);
      }
      @Override
      public void readFields(DataInput in) throws IOException {
          this.customer_id = in.readUTF();
          this.R = in.readLong();
          this.F = in.readLong();
          this.M = in.readDouble();
      }
      // This hashcode function is important as it is used by the custom
      // partitioner for this class.
      @Override
      public int hashCode() {
          return (int) (customer_id.hashCode() * 163 + R + F + M);
      }
      @Override
      public boolean equals(Object o) {
          if (o instanceof TextQuadlet) {
              TextQuadlet tp = (TextQuadlet) o;
              return customer_id.equals(tp.customer_id) && R == (tp.R) && F==(tp.F) && M==(tp.M);
          }
          return false;
      }
      @Override
      public String toString() {
          return customer_id + "," + R + "," + F + "," + M;
      }
      // LHS in the conditional statement is the current key
      // RHS in the conditional statement is the previous key
      // When you return a negative value, it means that you are exchanging
      // the positions of current and previous key-value pair
      // Returning 0 or a positive value means that you are keeping the
      // order as it is
      @Override
      public int compareTo(TextQuadlet tp) {
      // Here my natural is is customer_id and I don't even take it into
      // consideration.
      
      // So as you might have concluded, I am sorting R,F,M descendingly.
          if (this.R != tp.R) {
              if(this.R < tp.R) {
                  return 1;
              }
              else{
                  return -1;
              }
          }
          if (this.F != tp.F) {
              if(this.F < tp.F) {
                  return 1;
              }
              else{
                  return -1;
              }
          }
          if (this.M != tp.M){
              if(this.M < tp.M) {
                  return 1;
              }
              else{
                  return -1;
              }
          }
          return 0;
      }
      public static int compare(TextQuadlet tp1, TextQuadlet tp2) {
          int cmp = tp1.compareTo(tp2);
          return cmp;
      }
      public static int compare(Text customer_id1, Text customer_id2) {
          int cmp = customer_id1.compareTo(customer_id1);
          return cmp;
      }
      }
      
    2. Next you'll need a custom partitioner so that all the values which have the same key end up at one reducer:

      import org.apache.hadoop.io.Text;
      import org.apache.hadoop.mapreduce.Partitioner;
      
      public class FirstPartitioner_RFM extends Partitioner {
      @Override
      public int getPartition(TextQuadlet key, Text value, int numPartitions) {
          return (int) key.hashCode() % numPartitions;
         }
      }
      
    3. Thirdly, you'll need a custom group comparater so that all the values are grouped together by their natural key which is customer_id and not the composite key which is customer_id,R,F,M:

      import org.apache.hadoop.io.WritableComparable;
      import org.apache.hadoop.io.WritableComparator;
      
      public class GroupComparator_RFM_N extends WritableComparator {
      protected GroupComparator_RFM_N() {
          super(TextQuadlet.class, true);
      }
      @SuppressWarnings("rawtypes")
      @Override
      public int compare(WritableComparable w1, WritableComparable w2) {
          TextQuadlet ip1 = (TextQuadlet) w1;
          TextQuadlet ip2 = (TextQuadlet) w2;
          // Here we tell hadoop to group the keys by their natural key.
          return ip1.getCustomer_id().compareTo(ip2.getCustomer_id());
          }
      }
      
    4. Fourthly, you'll need a key comparater which will again sort the keys based on R,F,M descendingly and implement the same sort technique which is used in TextQuadlet.java. Since I got lost while coding, I slightly changed the way I compared data types in this function but the underlying logic is the same as in TextQuadlet.java:

      import org.apache.hadoop.io.WritableComparable;
      import org.apache.hadoop.io.WritableComparator;
      
      public class KeyComparator_RFM extends WritableComparator {
      protected KeyComparator_RFM() {
          super(TextQuadlet.class, true);
      }
      @SuppressWarnings("rawtypes")
      @Override
      public int compare(WritableComparable w1, WritableComparable w2) {
          TextQuadlet ip1 = (TextQuadlet) w1;
          TextQuadlet ip2 = (TextQuadlet) w2;
          // LHS in the conditional statement is the current key-value pair
          // RHS in the conditional statement is the previous key-value pair
          // When you return a negative value, it means that you are exchanging
          // the positions of current and previous key-value pair
          // If you are comparing strings, the string which ends up as the argument
          // for the `compareTo` method turns out to be the previous key and the
          // string which is invoking the `compareTo` method turns out to be the
          // current key.
          if(ip1.getR() == ip2.getR()){
              if(ip1.getF() == ip2.getF()){
                  if(ip1.getM() == ip2.getM()){
                      return 0;
                  }
                  else{
                      if(ip1.getM() < ip2.getM())
                          return 1;
                      else
                          return -1;
                  }
              }
              else{
                  if(ip1.getF() < ip2.getF())
                      return 1;
                  else
                      return -1;
              }
          }
          else{
              if(ip1.getR() < ip2.getR())
                  return 1;
              else
                  return -1;
              }
          }
      }
      
    5. And finally, in your driver class, you'll have to include our custom classes. Here I have used TextQuadlet,Text as k-v pair. But you can choose any other class depending on your needs.:

      job.setPartitionerClass(FirstPartitioner_RFM.class);
      job.setSortComparatorClass(KeyComparator_RFM.class);
      job.setGroupingComparatorClass(GroupComparator_RFM_N.class);
      job.setMapOutputKeyClass(TextQuadlet.class);
      job.setMapOutputValueClass(Text.class);
      job.setOutputKeyClass(TextQuadlet.class);
      job.setOutputValueClass(Text.class);
      

    Do correct me if I am technically going wrong somewhere in the code or in the explanation as I have based this answer purely on my personal understanding from what I read on the internet and it works for me perfectly.

    0 讨论(0)
提交回复
热议问题