How to perform a DISTINCT in Pig Latin on a subset of columns?

前端 未结 6 790
广开言路
广开言路 2020-12-30 07:06

I would like to perform a DISTINCT operation on a subset of the columns. The documentation says this is possible with a nested foreach:

You cannot us

相关标签:
6条回答
  • 2020-12-30 07:09
    unique_A = FOREACH (GROUP A BY (a1, a2, a3)) {
        limit_a = LIMIT A 1;
        GENERATE FLATTEN(limit_a) AS (a1,a2,a3,a4);
    };
    
    0 讨论(0)
  • 2020-12-30 07:10

    I was looking to do the same: "I would like to perform a DISTINCT operation on a subset of the columns". The way I did it was:

    A = LOAD 'data' AS(a1,a2,a3,a4);
    interested_fields = FOREACH A GENERATE a1,a2,a3;
    distinct_fields= DISTINCT interested_fields;
    final_answer = FOREACH distinct_fields GENERATE FLATTEN($0);
    

    I know it's not an example of how to perform a nested foreach as suggested in the documentation; but it's a way of doing a distinct over a subset of fields. Hope It helps to anyone who gets here just like I did.

    0 讨论(0)
  • 2020-12-30 07:22

    For your specified input/output, the following works. You might update your test vectors to clarify what you need that is different than this.

    A_unique = DISTINCT A;
    
    0 讨论(0)
  • 2020-12-30 07:31

    Group on all the other columns, project just the columns of interest into a bag, and then use FLATTEN to expand them out again:

    A_unique =
        FOREACH (GROUP A BY a4) {
            b = A.(a1,a2,a3);
            s = DISTINCT b;
            GENERATE FLATTEN(s), group AS a4;
        };
    
    0 讨论(0)
  • 2020-12-30 07:31

    The accepted answer is one great solution but, in case you want to reorder the fields in the output (something I had to do recently) this might not work. Here's an alternative:

    A = LOAD '$input' AS (f1, f2, f3, f4, f5);
    GP = GROUP A BY (f1, f2, f3);
    OUTPUT = FOREACH GP GENERATE 
        group.f1, group.f2, f4, f5, group.f3 ;
    

    When you group on certain fields, the selection would have unique values for the group in a each tuple.

    0 讨论(0)
  • 2020-12-30 07:31

    Here are 2 possible solutions, are there any other good approaches?

    Solution 1 (using LIMIT 1):

    A = LOAD 'test_data' AS (a1,a2,a3,a4);
    
    -- Combine the columns that I want to perform the distinct across into a tuple
    A2 = FOREACH A GENERATE TOTUPLE(a1,a2,a3) AS combined, a4 as a4
    
    -- Group by the combined column
    grouped_by_a4 = GROUP A2 BY combined;
    
    grouped_and_distinct = FOREACH grouped_by_a4 {
            single = LIMIT A2 1;
            GENERATE FLATTEN(single);
    };
    

    Solution 2 (using DISTINCT):

    A = LOAD 'test_data' AS (a1,a2,a3,a4);
    
    -- Combine the columns that I want to perform the distinct across into a tuple
    A2 = FOREACH A GENERATE TOTUPLE(a1,a2,a3) AS combined, a4 as a4
    
    -- Group by the other columns (those I don't want the distinct applied to)
    grouped_by_a4 = GROUP A2 BY a4;
    
    -- Perform the distinct on a projection of combined and flatten 
    grouped_and_distinct = FOREACH grouped_by_a4 {
            combined_unique = DISTINCT A2.combined;
            GENERATE FLATTEN(combined_unique);
    };
    
    0 讨论(0)
提交回复
热议问题