I would like to perform a DISTINCT operation on a subset of the columns. The documentation says this is possible with a nested foreach:
You cannot us
unique_A = FOREACH (GROUP A BY (a1, a2, a3)) {
limit_a = LIMIT A 1;
GENERATE FLATTEN(limit_a) AS (a1,a2,a3,a4);
};
I was looking to do the same: "I would like to perform a DISTINCT operation on a subset of the columns". The way I did it was:
A = LOAD 'data' AS(a1,a2,a3,a4);
interested_fields = FOREACH A GENERATE a1,a2,a3;
distinct_fields= DISTINCT interested_fields;
final_answer = FOREACH distinct_fields GENERATE FLATTEN($0);
I know it's not an example of how to perform a nested foreach as suggested in the documentation; but it's a way of doing a distinct over a subset of fields. Hope It helps to anyone who gets here just like I did.
For your specified input/output, the following works. You might update your test vectors to clarify what you need that is different than this.
A_unique = DISTINCT A;
Group on all the other columns, project just the columns of interest into a bag, and then use FLATTEN
to expand them out again:
A_unique =
FOREACH (GROUP A BY a4) {
b = A.(a1,a2,a3);
s = DISTINCT b;
GENERATE FLATTEN(s), group AS a4;
};
The accepted answer is one great solution but, in case you want to reorder the fields in the output (something I had to do recently) this might not work. Here's an alternative:
A = LOAD '$input' AS (f1, f2, f3, f4, f5);
GP = GROUP A BY (f1, f2, f3);
OUTPUT = FOREACH GP GENERATE
group.f1, group.f2, f4, f5, group.f3 ;
When you group on certain fields, the selection would have unique values for the group in a each tuple.
Here are 2 possible solutions, are there any other good approaches?
Solution 1 (using LIMIT 1):
A = LOAD 'test_data' AS (a1,a2,a3,a4);
-- Combine the columns that I want to perform the distinct across into a tuple
A2 = FOREACH A GENERATE TOTUPLE(a1,a2,a3) AS combined, a4 as a4
-- Group by the combined column
grouped_by_a4 = GROUP A2 BY combined;
grouped_and_distinct = FOREACH grouped_by_a4 {
single = LIMIT A2 1;
GENERATE FLATTEN(single);
};
Solution 2 (using DISTINCT):
A = LOAD 'test_data' AS (a1,a2,a3,a4);
-- Combine the columns that I want to perform the distinct across into a tuple
A2 = FOREACH A GENERATE TOTUPLE(a1,a2,a3) AS combined, a4 as a4
-- Group by the other columns (those I don't want the distinct applied to)
grouped_by_a4 = GROUP A2 BY a4;
-- Perform the distinct on a projection of combined and flatten
grouped_and_distinct = FOREACH grouped_by_a4 {
combined_unique = DISTINCT A2.combined;
GENERATE FLATTEN(combined_unique);
};