Removing duplicates using PigLatin

后端 未结 2 822
一向
一向 2020-12-29 13:01

I\'m using PigLatin to filter some records.

User1  8 NYC 
User1  9 NYC 
User1  7 LA 
User2  4 NYC
User2  3 DC 

The script should remove the

相关标签:
2条回答
  • 2020-12-29 13:18

    For your particular example distinct will not work well as your output contains all of the input columns ($0, $1, $2), you can do distinct only on a projection that has columns ($0, $2) or ($0) and lose $1.

    In order to select one record per user (any record) you could use a GROUP BY and a nested FOREACH with LIMIT. Ex:

    inpt = load '......' ......;
    user_grp = GROUP inpt BY $0;
    filtered = FOREACH user_grp {
          top_rec = LIMIT inpt 1;
          GENERATE FLATTEN(top_rec);
    };
    

    This approach will help you get records that are unique on a subset of fields and also limit number of output records per each user, which you can control.

    0 讨论(0)
  • 2020-12-29 13:39

    Pig provide DISTINCT command to select unique data. If you want use distinct on fields Use Distinct in foreach nested block.

    0 讨论(0)
提交回复
热议问题