I\'m using PigLatin to filter some records.
User1 8 NYC
User1 9 NYC
User1 7 LA
User2 4 NYC
User2 3 DC
The script should remove the
For your particular example distinct will not work well as your output contains all of the input columns ($0, $1, $2)
, you can do distinct only on a projection that has columns ($0, $2)
or ($0)
and lose $1
.
In order to select one record per user (any record) you could use a GROUP BY
and a nested FOREACH
with LIMIT
. Ex:
inpt = load '......' ......;
user_grp = GROUP inpt BY $0;
filtered = FOREACH user_grp {
top_rec = LIMIT inpt 1;
GENERATE FLATTEN(top_rec);
};
This approach will help you get records that are unique on a subset of fields and also limit number of output records per each user, which you can control.
Pig provide DISTINCT command to select unique data. If you want use distinct on fields Use Distinct in foreach nested block.