how to delete the rows of data which is repeating in Pig

那年仲夏 提交于 2020-03-05 01:02:39

问题


"YouTube Rewind: The Shape of 2017 | #YouTubeRewind" 137843120 3014479 1602383 817582

"YouTube Rewind: The Shape of 2017 | #YouTubeRewind" 125431369 2912715 1545018 807558

"YouTube Rewind: The Shape of 2017 | #YouTubeRewind" 113876217 2811217 1470387 787174

"YouTube Rewind: The Shape of 2017 | #YouTubeRewind" 100911567 2656678 1353655 682890

"Marvel Studios' Avengers: Infinity War Official Trailer" 89930713 2606665 53011 347982

"Marvel Studios' Avengers: Infinity War Official Trailer" 87450245 2584675 52176 341571

"Marvel Studios' Avengers: Infinity War Official Trailer" 84281319 2555414 51008 339708

"Marvel Studios' Avengers: Infinity War Official Trailer" 80360459 2513103 49170 335920

"YouTube Rewind: The Shape of 2017 | #YouTubeRewind" 75969469 2251826 1127811 827755

"Marvel Studios' Avengers: Infinity War Official Trailer" 74789251 2444960 46172 330710

"Marvel Studios' Avengers: Infinity War Official Trailer" 66637636 2331359 41154 316185

"Marvel Studios' Avengers: Infinity War Official Trailer" 56367282 2157741 34078 303178

"YouTube Rewind: The Shape of 2017 | #YouTubeRewind" 52611730 1891822 884963 702784

"To Our Daughter" 51243149 0 0 0

"To Our Daughter" 48635732 0 0 0

in above data there is 2 columns one is "title" and other are views, likes, dislikes, comment_count.

how to use filter and remove repeating data i want to remove the data which is having same "title: and keep the data with highest views


回答1:


the data is having 15 cols i selected the 5 cols which are mentioned in the question 'select_specified_columns' defines the above asked table

only_likes = foreach select_specified_columns generate $0,$1;

store only_likes into 'test/only_likes';

group_likes = group only_likes by $0;

store group_likes into 'test/group_likes';

max_likes = foreach group_likes generate group , MAX(only_likes.views);

store max_likes into 'test/max_likes';

result_likes = order max_likes by $1 DESC;

store result_likes into 'test/result_likes';

i did that using following commands




回答2:


If you want to retain all fields of the record corresponding to the MAX likes, you would have to do something like so:

dataAll = LOAD 'path' USING PigStorage('\t') AS (title:chararray, views:long, likes:long, dislikes:long, comment_count:long);

--group the data by title so that all records belonging to a title fall into a bag in the same record
dataGrouped = GROUP dataAll BY title;

--Using a nested foreach, order the contents of the bag by likes and pick the top record
dataDeduped = FOREACH dataGrouped {
                 soredtedByLikes = ORDER dataAll BY likes DESC;
                 maxLikesRecord = LIMIT soredtedByLikes 1;
                 GENERATE FLATTEN(maxLikesRecord);
              }

STORE dataDeduped INTO 'outputPath' USING PigStorage('\t');

Nested Foreach comes in pretty useful in such situations. Checkout more about it here: https://www.safaribooksonline.com/library/view/programming-pig/9781449317881/ch06.html (Search for nested foreach in that link).



来源:https://stackoverflow.com/questions/49952446/how-to-delete-the-rows-of-data-which-is-repeating-in-pig

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!