the table is:
create table test (
id string,
name string,
age string,
modified string)
data like this:
id name age modife
Presume the data is like this:
id name age modifed
1 a 10 2011-11-11 11:11:11
1 a 11 2012-11-11 12:00:00
2 b 23 2012-12-10 10:11:12
2 b 21 2012-12-10 10:11:12
2 b 22 2012-12-15 10:11:12
2 b 20 2012-12-15 10:11:12
then the result of the above query will give you - (notice the repeated 2, b having the same date time)
1 a 11 2012-11-11 12:00:00
2 b 22 2012-12-15 10:11:12
2 b 20 2012-12-15 10:11:12
This query runs an additional group by and is less efficient but gives the correct result -
select collect_set(b.id)[0], collect_set(b.name)[0], collect_set(b.age)[0], b.modified
from
(select id, max(modified) as modified from test group by id) a
left outer join
test b
on
(a.id=b.id and a.modified=b.modified)
group by
b.modified;
then the result of the above query will give you
1 a 11 2012-11-11 12:00:00
2 b 20 2012-12-15 10:11:12
Now if we improve the query a little - then in place of 3 MRs it runs only one Keping the result same -
select id, collect_set(name)[0], collect_set(age)[0], max(modified)
from test
group by id;
Note - this will slow down in case your group by field produces large results.