How do I extract the first tuple from a generated bag (whose size might vary) in PIG?

前端 未结 3 1100
旧时难觅i
旧时难觅i 2021-01-14 05:58

I am generating a \'bag\' of information whose size (number of tuples inside the the bag) might vary. From this, I want to extract the first element on the fly. How do I do

相关标签:
3条回答
  • 2021-01-14 06:24

    Use DataFu UDF: FirstTupleFromBag (http://datafu.incubator.apache.org/docs/datafu/1.2.0/datafu/pig/bags/FirstTupleFromBag.html)

    0 讨论(0)
  • 2021-01-14 06:44

    If the ordering of the tuple in the bag is important to get the "first" one (of course it is!) then you could do something like the following which is explained in more detail at https://community.hortonworks.com/questions/22863/cant-we-filter-the-data-which-we-have-done-in-37-s.html#answer-22995.

    max_runs = FOREACH grp_data {
        inner_sorted = ORDER runs BY runs DESC;
        first_row = LIMIT inner_sorted 1;
        GENERATE first_row AS most_hits;
    }
    
    0 讨论(0)
  • 2021-01-14 06:48

    According to the docs, a bag is a collection of tuples and

    Bag dereferencing can be done by name (bag.field_name) or position (bag.$0). If a set of fields are dereferenced (bag.(name1, name2) or bag.($0, $1)), the expression represents a bag composed of the specified fields.

    But be careful, b.$0 doesn't give you the first tuple in the bag, because bags aren't ordered! You'll get the first elements of the constituent tuples.

    You will need to either convert the bag to an ordered structure, or better, use a UDF. You should also unaccept this answer (so I can delete it) and accept Guarev's instead, who has a link to a UDF.

    0 讨论(0)
提交回复
热议问题