Postgresql group by for multiple lines

问题

I have this table named hr_holidays_by_calendar. I just want to filter out the rows where the same employee is having two leaves in same day.

Table hr_holidays_by_calendar:

Query I tried:
Wasn't anywhere near in solving this.

select hol1.employee_id, hol1.leave_date, hol1.no_of_days, hol1.leave_state
from hr_holidays_by_calendar hol1
inner join
    (select employee_id, leave_date 
    from hr_holidays_by_calendar hol1
    group by employee_id, leave_date 
    having count(*)>1)sub
on hol1.employee_id=sub.employee_id and hol1.leave_date=sub.leave_date
where hol1.leave_state != 'refuse'
order by hol1.employee_id, hol1.leave_date

回答1:

This returns all rows where a duplicate exists:

SELECT employee_id, leave_date, no_of_days, leave_state
FROM   hr_holidays_by_calendar h
WHERE  EXISTS (
   SELECT                         -- select list can be empty for this
   FROM   hr_holidays_by_calendar
   WHERE  employee_id = h.employee_id
   AND    leave_date = h.leave_date
   AND    leave_state <> 'refuse'
   AND    ctid <> h.ctid
   )
AND    leave_state <> 'refuse'
ORDER  BY employee_id, leave_date;

It's unclear where leave_state <> 'refuse' should be applied. You would have to define requirements. My example ignores rows with leave_state = 'refuse' (and leave_state IS NULL with it!) completely.

ctid is a poor man's surrogate for your undeclared (undefined?) primary key.

How do I (or can I) SELECT DISTINCT on multiple columns?
What is easier to read in EXISTS subqueries?

回答2:

I assume you just need to reverse your logic. You could use NOT EXISTS:

select h1.employee_id, h1.leave_date, h1.no_of_days, h1.leave_state
from hr_holidays_by_calendar h1
where 
  h1.leave_state <> 'refuse'
  and not exists (
    select 1
    from hr_holidays_by_calendar h2
    where 
      h1.employee_id = h2.employee_id
      and h1.leave_date = h2.leave_date
      group by employee_id, leave_date
      having count(*) > 1
  )

This will discard every (employee, date) pair where they have more than one row (leave on the same day).

I did not take number of days into account, since that seems to be wrong anyways - you can't have a leave twice in on the same day which lasts for different amount of days. If your application allows it, consider applying additional logic. Also, you shouldn't let these records get in the table in the first place :-)

回答3:

I believe that simple use of a GROUP BY can do the job for you

select hol1.employee_id, hol1.leave_date, max(hol1.no_of_days)
from hr_holidays_by_calendar hol1
where hol1.leave_state != 'refuse'
group by hol1.employee_id, hol1.leave_date

It is not clear what should happen if two rows have different no_of_days.

回答4:

If you want the complete rows, one method uses window functions:

select hc.*
from (select hc.*, count(*) over (partition by employee_id, leave_date) as cnt
      from hr_holidays_by_calendar hc
     ) hc
where cnt >= 2;

Aggregation is appropriate if you just want the employee id and dates.

来源：https://stackoverflow.com/questions/52659545/postgresql-group-by-for-multiple-lines

标签

sql

postgresql

duplicates

postgresql-9.3