How to flatten a group into a single tuple in Pig?

后端 未结 3 1529
暗喜
暗喜 2020-12-14 04:52

From this:

(1, {(1,2), (1,3), (1,4)} )
(2, {(2,5), (2,6), (2,7)} )

...How could we generate this?

((1,2),(1,3),(1,4))
((2,5         


        
相关标签:
3条回答
  • 2020-12-14 05:34

    For your question, I prepared the following file:

    1,2
    1,3
    1,4
    2,5
    2,6
    2,7
    

    At first, I used the following script to get the input r3 which you described in your question:

    r1 = load 'test_file' using PigStorage(',') as (a:int, b:int);
    r2 = group r1 by a;
    r3 = foreach r2 generate group as a, r1 as b;
    describe r3;
    -- r3: {a: int,b: {(a: int,b: int)}}
    -- r3 is like (1, {(1,2), (1,3), (1,4)} )
    

    If we want to generate the following content,

    (1, 2, 3, 4)
    (2, 5, 6, 7)
    

    we can use the following script:

    r4 = foreach r3 generate a, FLATTEN(BagToTuple(b.b));
    dump r4;
    

    For the following content,

    ((1,2),(1,3),(1,4))
    ((2,5),(2,6),(2,7))
    

    I can not find any helpful builtin function. Maybe you need to write your custom BagToTuple. Here is the builtin BagToTuple source codes: http://www.grepcode.com/file/repo1.maven.org/maven2/org.apache.pig/pig/0.11.1/org/apache/pig/builtin/BagToTuple.java#BagToTuple.getOuputTupleSize%28org.apache.pig.data.DataBag%29

    0 讨论(0)
  • 2020-12-14 05:49

    There is no builtin way to convert a bag to a tuple. This is because bags are unordered sets of tuples, so Pig doesn't know what order that the tuples should be set to when it is converted into a tuple. This means that you'll have to write a UDF to do this.

    I'm not sure how you are creating the (1, 2, 3, 4) tuple, but this is another good candidate for a UDF, even though you could create that schema with just the BagToTuple UDF.

    NOTE: You probably shouldn't be turning anything into a tuple unless you know exactly how many fields there are.

    myudfs.py

    #!/usr/bin/python
    
    @outputSchema('T:(T1:(a1:chararray, a2:chararray), T2:(b1:chararray, b2:chararray), T3:(c1:chararray, c2:chararray))')
    def BagToTuple(B):
        return tuple(B)
    
    def generateContent(B):
        foo = [B[0][0]] + [ t[1] for t in B ]
        return tuple(foo)
    

    myscript.pig

    REGISTER 'myudfs.py' USING jython AS myudfs ; 
    
    -- A is (1, {(1,2), (1,3), (1,4)} ) 
    -- The schema is (I:int, B:{T:(I1:int, I2:int)})
    
    B = FOREACH A GENERATE myudfs.BagToTuple(B) ;
    C = FOREACH A GENERATE myudfs.generateContent(B) ;
    
    0 讨论(0)
  • 2020-12-14 05:58

    In order to obtain :

    ((1,2),(1,3),(1,4))
    ((2,5),(2,6),(2,7))
    

    You can do this :

    r4 = foreach r3 {
        Tmp=foreach $1 generate (a,b);
        generate FLATTEN(BagToTuple(Tmp));
    };
    
    0 讨论(0)
提交回复
热议问题