How to perform a DISTINCT in Pig Latin on a subset of columns?

前端未结

关注

 6  790

I would like to perform a DISTINCT operation on a subset of the columns. The documentation says this is possible with a nested foreach:

You cannot us

相关标签:

6条回答

悲哀的现实

2020-12-30 07:09

unique_A = FOREACH (GROUP A BY (a1, a2, a3)) {
    limit_a = LIMIT A 1;
    GENERATE FLATTEN(limit_a) AS (a1,a2,a3,a4);
};

0 讨论(0)

执念已碎

2020-12-30 07:10
I was looking to do the same: "I would like to perform a DISTINCT operation on a subset of the columns". The way I did it was:
```
A = LOAD 'data' AS(a1,a2,a3,a4);
interested_fields = FOREACH A GENERATE a1,a2,a3;
distinct_fields= DISTINCT interested_fields;
final_answer = FOREACH distinct_fields GENERATE FLATTEN($0);
```
I know it's not an example of how to perform a nested foreach as suggested in the documentation; but it's a way of doing a distinct over a subset of fields. Hope It helps to anyone who gets here just like I did.
0 讨论(0)
发布评论:

提交评论
- 加载中...
花落未央

2020-12-30 07:22
For your specified input/output, the following works. You might update your test vectors to clarify what you need that is different than this.
```
A_unique = DISTINCT A;
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
挽巷

2020-12-30 07:31
Group on all the other columns, project just the columns of interest into a bag, and then use FLATTEN to expand them out again:
```
A_unique =
    FOREACH (GROUP A BY a4) {
        b = A.(a1,a2,a3);
        s = DISTINCT b;
        GENERATE FLATTEN(s), group AS a4;
    };
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
不知归路

2020-12-30 07:31
The accepted answer is one great solution but, in case you want to reorder the fields in the output (something I had to do recently) this might not work. Here's an alternative:
```
A = LOAD '$input' AS (f1, f2, f3, f4, f5);
GP = GROUP A BY (f1, f2, f3);
OUTPUT = FOREACH GP GENERATE 
    group.f1, group.f2, f4, f5, group.f3 ;
```
When you group on certain fields, the selection would have unique values for the group in a each tuple.
0 讨论(0)
发布评论:

提交评论
- 加载中...

面向向阳花

2020-12-30 07:31

Here are 2 possible solutions, are there any other good approaches?

Solution 1 (using LIMIT 1):

A = LOAD 'test_data' AS (a1,a2,a3,a4);

-- Combine the columns that I want to perform the distinct across into a tuple
A2 = FOREACH A GENERATE TOTUPLE(a1,a2,a3) AS combined, a4 as a4

-- Group by the combined column
grouped_by_a4 = GROUP A2 BY combined;

grouped_and_distinct = FOREACH grouped_by_a4 {
        single = LIMIT A2 1;
        GENERATE FLATTEN(single);
};

Solution 2 (using DISTINCT):

A = LOAD 'test_data' AS (a1,a2,a3,a4);

-- Combine the columns that I want to perform the distinct across into a tuple
A2 = FOREACH A GENERATE TOTUPLE(a1,a2,a3) AS combined, a4 as a4

-- Group by the other columns (those I don't want the distinct applied to)
grouped_by_a4 = GROUP A2 BY a4;

-- Perform the distinct on a projection of combined and flatten 
grouped_and_distinct = FOREACH grouped_by_a4 {
        combined_unique = DISTINCT A2.combined;
        GENERATE FLATTEN(combined_unique);
};

0 讨论(0)