How can I add row numbers for rows in PIG or HIVE?

前端 未结 8 1850
半阙折子戏
半阙折子戏 2020-12-17 02:18

I have a problem when adding row numbers using Apache Pig. The problem is that I have a STR_ID column and I want to add a ROW_NUM column for the data in STR_ID, which is the

相关标签:
8条回答
  • 2020-12-17 03:18

    For folks wondering about Pig, I found the best way (currently) is to write your own UDF. I wanted to add row numbers for tuples in a bag. This is the code for that:

    import java.io.IOException;
    import java.util.Iterator;
    import org.apache.pig.EvalFunc;
    import org.apache.pig.backend.executionengine.ExecException;
    import org.apache.pig.data.BagFactory;
    import org.apache.pig.data.DataBag;
    import org.apache.pig.data.Tuple;
    import org.apache.pig.data.TupleFactory;
    import org.apache.pig.impl.logicalLayer.schema.Schema;
    import org.apache.pig.data.DataType;
    
    public class RowCounter extends EvalFunc<DataBag> {
    TupleFactory mTupleFactory = TupleFactory.getInstance();
    BagFactory mBagFactory = BagFactory.getInstance();
    public DataBag exec(Tuple input) throws IOException {
        try {
            DataBag output = mBagFactory.newDefaultBag();
            DataBag bg = (DataBag)input.get(0);
            Iterator it = bg.iterator();
            Integer count = new Integer(1);
            while(it.hasNext())
                { Tuple t = (Tuple)it.next();
                  t.append(count);
                  output.add(t);
                  count = count + 1;
                }
    
            return output;
        } catch (ExecException ee) {
            // error handling goes here
            throw ee;
        }
    }
    public Schema outputSchema(Schema input) {
         try{
             Schema bagSchema = new Schema();
             bagSchema.add(new Schema.FieldSchema(null, DataType.BAG));
    
             return new Schema(new Schema.FieldSchema(getSchemaName(this.getClass().getName().toLowerCase(), input),
                                                    bagSchema, DataType.BAG));
         }catch (Exception e){
            return null;
         }
        }
    }
    

    This code is for reference only. Might not be error-proof.

    0 讨论(0)
  • 2020-12-17 03:20

    Hive solution -

    select *
      ,rank() over (rand()) as row_num
      from table
    

    Or, if you want to have rows ascending by STR_ID -

    select *
      ,rank() over (STR_ID,rank()) as row_num
      from table
    
    0 讨论(0)
提交回复
热议问题