I\'d love some some help handling a strange edge case with a paginated API I\'m building.
Like many APIs, this one paginates large results. If you query /foos, you\'
You have several problems.
First, you have the example that you cited.
You also have a similar problem if rows are inserted, but in this case the user get duplicate data (arguably easier to manage than missing data, but still an issue).
If you are not snapshotting the original data set, then this is just a fact of life.
You can have the user make an explicit snapshot:
POST /createquery
filter.firstName=Bob&filter.lastName=Eubanks
Which results:
HTTP/1.1 301 Here's your query
Location: http://www.example.org/query/12345
Then you can page that all day long, since it's now static. This can be reasonably light weight, since you can just capture the actual document keys rather than the entire rows.
If the use case is simply that your users want (and need) all of the data, then you can simply give it to them:
GET /query/12345?all=true
and just send the whole kit.
Option A: Keyset Pagination with a Timestamp
In order to avoid the drawbacks of offset pagination you have mentioned, you can use keyset based pagination. Usually, the entities have a timestamp that states their creation or modification time. This timestamp can be used for pagination: Just pass the timestamp of the last element as the query parameter for the next request. The server, in turn, uses the timestamp as a filter criterion (e.g. WHERE modificationDate >= receivedTimestampParameter
)
{
"elements": [
{"data": "data", "modificationDate": 1512757070}
{"data": "data", "modificationDate": 1512757071}
{"data": "data", "modificationDate": 1512757072}
],
"pagination": {
"lastModificationDate": 1512757072,
"nextPage": "https://domain.de/api/elements?modifiedSince=1512757072"
}
}
This way, you won't miss any element. This approach should be good enough for many use cases. However, keep the following in mind:
You can make those drawbacks less likely by increasing the page size and using timestamps with millisecond precision.
Option B: Extended Keyset Pagination with a Continuation Token
To handle the mentioned drawbacks of the normal keyset pagination, you can add an offset to the timestamp and use a so-called "Continuation Token" or "Cursor". The offset is the position of the element relative to the first element with the same timestamp. Usually, the token has a format like Timestamp_Offset
. It's passed to the client in the response and can be submitted back to the server in order to retrieve the next page.
{
"elements": [
{"data": "data", "modificationDate": 1512757070}
{"data": "data", "modificationDate": 1512757072}
{"data": "data", "modificationDate": 1512757072}
],
"pagination": {
"continuationToken": "1512757072_2",
"nextPage": "https://domain.de/api/elements?continuationToken=1512757072_2"
}
}
The token "1512757072_2" points to the last element of the page and states "the client already got the second element with the timestamp 1512757072". This way, the server knows where to continue.
Please mind that you have to handle cases where the elements got changed between two requests. This is usually done by adding a checksum to the token. This checksum is calculated over the IDs of all elements with this timestamp. So we end up with a token format like this: Timestamp_Offset_Checksum
.
For more information about this approach check out the blog post "Web API Pagination with Continuation Tokens". A drawback of this approach is the tricky implementation as there are many corner cases that have to be taken into account. That's why libraries like continuation-token can be handy (if you are using Java/a JVM language). Disclaimer: I'm the author of the post and a co-author of the library.
It may be tough to find best practices since most systems with APIs don't accommodate for this scenario, because it is an extreme edge, or they don't typically delete records (Facebook, Twitter). Facebook actually says each "page" may not have the number of results requested due to filtering done after pagination. https://developers.facebook.com/blog/post/478/
If you really need to accommodate this edge case, you need to "remember" where you left off. jandjorgensen suggestion is just about spot on, but I would use a field guaranteed to be unique like the primary key. You may need to use more than one field.
Following Facebook's flow, you can (and should) cache the pages already requested and just return those with deleted rows filtered if they request a page they had already requested.
Just to add to this answer by Kamilk : https://www.stackoverflow.com/a/13905589
Depends a lot on how large dataset you are working on. Small data sets do work on effectively on offset pagination but large realtime datasets do require cursor pagination.
Found a wonderful article on how Slack evolved its api's pagination as there datasets increased explaining the positives and negatives at every stage : https://slack.engineering/evolving-api-pagination-at-slack-1c1f644f8e12
I'm not completely sure how your data is handled, so this may or may not work, but have you considered paginating with a timestamp field?
When you query /foos you get 100 results. Your API should then return something like this (assuming JSON, but if it needs XML the same principles can be followed):
{
"data" : [
{ data item 1 with all relevant fields },
{ data item 2 },
...
{ data item 100 }
],
"paging": {
"previous": "http://api.example.com/foo?since=TIMESTAMP1"
"next": "http://api.example.com/foo?since=TIMESTAMP2"
}
}
Just a note, only using one timestamp relies on an implicit 'limit' in your results. You may want to add an explicit limit or also use an until
property.
The timestamp can be dynamically determined using the last data item in the list. This seems to be more or less how Facebook paginates in its Graph API (scroll down to the bottom to see the pagination links in the format I gave above).
One problem may be if you add a data item, but based on your description it sounds like they would be added to the end (if not, let me know and I'll see if I can improve on this).
If you've got pagination you also sort the data by some key. Why not let API clients include the key of the last element of the previously returned collection in the URL and add a WHERE
clause to your SQL query (or something equivalent, if you're not using SQL) so that it returns only those elements for which the key is greater than this value?