问题
"YouTube Rewind: The Shape of 2017 | #YouTubeRewind" 137843120 3014479 1602383 817582
"YouTube Rewind: The Shape of 2017 | #YouTubeRewind" 125431369 2912715 1545018 807558
"YouTube Rewind: The Shape of 2017 | #YouTubeRewind" 113876217 2811217 1470387 787174
"YouTube Rewind: The Shape of 2017 | #YouTubeRewind" 100911567 2656678 1353655 682890
"Marvel Studios' Avengers: Infinity War Official Trailer" 89930713 2606665 53011 347982
"Marvel Studios' Avengers: Infinity War Official Trailer" 87450245 2584675 52176 341571
"Marvel Studios' Avengers: Infinity War Official Trailer" 84281319 2555414 51008 339708
"Marvel Studios' Avengers: Infinity War Official Trailer" 80360459 2513103 49170 335920
"YouTube Rewind: The Shape of 2017 | #YouTubeRewind" 75969469 2251826 1127811 827755
"Marvel Studios' Avengers: Infinity War Official Trailer" 74789251 2444960 46172 330710
"Marvel Studios' Avengers: Infinity War Official Trailer" 66637636 2331359 41154 316185
"Marvel Studios' Avengers: Infinity War Official Trailer" 56367282 2157741 34078 303178
"YouTube Rewind: The Shape of 2017 | #YouTubeRewind" 52611730 1891822 884963 702784
"To Our Daughter" 51243149 0 0 0
"To Our Daughter" 48635732 0 0 0
in above data there is 2 columns one is "title" and other are views, likes, dislikes, comment_count.
how to use filter and remove repeating data i want to remove the data which is having same "title: and keep the data with highest views
回答1:
the data is having 15 cols i selected the 5 cols which are mentioned in the question 'select_specified_columns' defines the above asked table
only_likes = foreach select_specified_columns generate $0,$1;
store only_likes into 'test/only_likes';
group_likes = group only_likes by $0;
store group_likes into 'test/group_likes';
max_likes = foreach group_likes generate group , MAX(only_likes.views);
store max_likes into 'test/max_likes';
result_likes = order max_likes by $1 DESC;
store result_likes into 'test/result_likes';
i did that using following commands
回答2:
If you want to retain all fields of the record corresponding to the MAX likes, you would have to do something like so:
dataAll = LOAD 'path' USING PigStorage('\t') AS (title:chararray, views:long, likes:long, dislikes:long, comment_count:long);
--group the data by title so that all records belonging to a title fall into a bag in the same record
dataGrouped = GROUP dataAll BY title;
--Using a nested foreach, order the contents of the bag by likes and pick the top record
dataDeduped = FOREACH dataGrouped {
soredtedByLikes = ORDER dataAll BY likes DESC;
maxLikesRecord = LIMIT soredtedByLikes 1;
GENERATE FLATTEN(maxLikesRecord);
}
STORE dataDeduped INTO 'outputPath' USING PigStorage('\t');
Nested Foreach comes in pretty useful in such situations. Checkout more about it here: https://www.safaribooksonline.com/library/view/programming-pig/9781449317881/ch06.html (Search for nested foreach in that link).
来源:https://stackoverflow.com/questions/49952446/how-to-delete-the-rows-of-data-which-is-repeating-in-pig