JPA Select latest instance for each item

后端 未结 5 705
悲&欢浪女
悲&欢浪女 2020-12-03 05:09

Let\'s say I have a Meeting entity. Each meeting has a single attendee and a meeting date. Within my meeting table I may have multiple meetings for each attendee, with dif

相关标签:
5条回答
  • 2020-12-03 05:19

    In SQL the solution is very simple - join the table with a subquery, which gives you the most recent meeting for each attendee:

    select * from Meeting ALL
    join ( select max(meetingDate) as newest, attendee
    from Meeting group by attendee ) LATEST
    on ALL.meetingDate = LATEST.newest AND ALL.attendee = LATEST.attendee
    

    This works, and works fast!

    The problem with JPA is that it (or most implementations) won't allow a subquery for a join. After spending several hours trying what will compile in the first place, and then, how slow it is, I decided that I hate JPA. Solutions like the ones above - like EXISTS (SELECT .. ) or IN ( SELECT .. ) - take ages to execute, orders of magnitude slower than they should.

    Having a solution that works meant that I just needed to access that solution from JPA. There are two magic words in SQL that help you do just that:

    CREATE VIEW
    

    and the life becomes so much simpler... Just define such entity and use it. Caution: it's read-only.

    Of course, any JPA purists will look down on you when you do that, so if anyone has a pure JPA solution, please let us both know!

    0 讨论(0)
  • 2020-12-03 05:25

    Well in SQL that would be quite simple I think, so I assume that can be mapped to JPA:

    SELECT m.AttendeeId, MAX(m.MeetingDate) from Meeting m GROUP BY m.AttendeeId
    

    Edit: If you need the messageId itself as well you can do that with a simple subquery that returns the messageId for a message where the other two values are equal. Just make sure you handle the case where there are several messageIds for the same Attendee and Date (eg pick the first result since they should all be equally good - although I'd doubt that such data even makes sense for meetings)

    0 讨论(0)
  • 2020-12-03 05:32

    Plain SQL

    As Bulba has said appropriate way is to join a subquery with group by.

    JPA, JPQL

    The problem is that you can't join a subquery.

    Here is a workaround.

    Lets see what you get in the subquery with group by. You get a list of pairs (attendee_id, max(meeting_date)). This pair is like a new unique id for row with max date you want to join on. Then note that each row in the table forms a pair (attendee_id, meeting_date). So every row has an id as a pair (attendee_id, meeting_date). Lets take a row if only it forms an id that belongs to list received in the subquery.

    For simplicity lets represent this id-pair as a concatenation of attendee_id and meeting_date: concat(attendee_id, meeting_date).

    Then the query in SQL(similarly for JPQL and JPA CriteriaBuilder) would be as follows:

    SELECT * FROM meetings 
    WHERE concat(attendee_id, meeting_date) IN
    (SELECT concat(attendee_id, max(meeting_date)) FROM meetings GROUP BY attendee_id)
    

    Note that there is only one subquery per query, not one subquery for each row like in some answers.

    Afraid of comparing strings?

    We have a special offer for you!

    Lets encode that id-pair to number. It will be a sum of attendee_id and meeting_date but with modifications to ensure uniqueness of code. We can take number representation of date as Unix time. We will fix the value of max date that our code can capture because final code has max value limit (e.g. bigint(int8)<263). Lets take for convenience max date as 2149-06-07 03:00:00. It equals 5662310400 in seconds and 65536 in days. I will assume here that we need precision for date in days(so we ignore hours and below). To construct unique code we can interpret it as a number in a numerical system with base of 65536. The last symbol(number from 0 to 216-1) in or code in such numerical system is number of days. Other symbols will capture attendee_id. In such interpretation code looks like XXXX, where each X is in range [0,216-1] (to be more accurate, first X is in range [0,215-1] because of 1 bit for sign), first three X represents attendee_id and last X represents meeting_date. So the max value of attendee_id our code can capture is 247-1. The code can be computed as attendee_id*65536 + "date in days".

    In postgresql it will be:

    attendee_id*65536 + date_part('epoch', meeting_date)/(60*60*24)
    

    Where date_part returns date in seconds which we convert to days by dividing on constant.

    And final query to get the latest meetings for all attendees:

    SELECT * FROM meetings
    WHERE attendee_id*65536 + date_part('epoch', meeting_date)/(60*60*24)
    IN (SELECT attendee_id*65536 + date_part('epoch', max(meeting_date))/(60*60*24) from meetings GROUP BY attendee_id);
    

    Benchmarking

    I have created a table with stucture as in the question and populated it with 100000 rows randomly selecting attendee_id from [1, 10000] and random date from range [1970-01-01, 2017-09-16]. I have benchmarked (with EXPLAIN ANALYZE) queries with the following techniques:

    1. Correlated subquery

      SELECT * FROM meetings m1 WHERE m1.meeting_date=
      (SELECT max(m2.meeting_date) FROM meetings m2 WHERE m2.attendee_id=m1.attendee_id);
      

      Execution time: 873260.878 ms

    2. Join subquery with group by

      SELECT * FROM meetings m
      JOIN (SELECT attendee_id, max(meeting_date) from meetings GROUP BY attendee_id) attendee_max_date
      ON attendee_max_date.attendee_id = m.attendee_id;</code>
      

      Execution time: 103.427 ms

    3. Use pair (attendee_id, date) as a key

      • Concat attendee_id and meeting_date as strings

        SELECT * FROM meetings WHERE concat(attendee_id, meeting_date) IN
        (SELECT concat(attendee_id, max(meeting_date)) from meetings GROUP BY attendee_id);
        

        Execution time: 207.720 ms

      • Encode attendee_id and meeting_date to a single number(code)

        SELECT * FROM meetings
        WHERE attendee_id*65536 + date_part('epoch',meeting_date)/(60*60*24)
        IN (SELECT attendee_id*65536 + date_part('epoch',max(meeting_date))/(60*60*24) from meetings GROUP BY attendee_id);
        

        Execution time: 127.595 ms

    Here is a git with table scheme, table data (as csv), code for populating table, and queries.

    0 讨论(0)
  • 2020-12-03 05:39

    Try this

    SELECT MAX(m.MeetingDate) FROM Meeting m
    
    0 讨论(0)
  • 2020-12-03 05:42

    I think I've got it with this query.

    select m from Meeting m 
        where m.meetingDate = 
            (select max(m1.meetingDate) 
                from Meeting m1 
                where m1.attendee = m.attendee )
        and not exists 
            (select m2 from Meeting m2 
                where m2.attendee = m.attendee 
                and m2.meetingDate > m.meetingDate)
    
    0 讨论(0)
提交回复
热议问题