Add pagination #279

mgirlich · 2023-08-15T07:14:28Z

Pagination support would be great to have. This PR provides basic support for pagination. It adds a general req_paginate() and three helpers for common patterns:

req_paginate_next_url(): the response contains a next url (usually in the body)
req_paginate_offset(): simple pagination with limit and offset (typically in the query)
req_paginate_next_token(): pagination via a token/cursor

I could test the first two helpers via the public Pokemon API, the last one only via an API that needs authorisation.

# next url ----------------------------------------------------------------

library(tibblify)
responses <- request("https://pokeapi.co/api/v2/pokemon") %>% 
  req_paginate_next_url(
    next_field = "next",
    results_field = "results",
    limit = in_query("limit", 150L),
    total_field = "count",
    max_pages = 2
  )

spec <- tspec_df(
  tib_chr("name"),
  tib_chr("url"),
)

pokemon_df <- purrr::map(
  responses,
  \(resp_body) {
    tibblify(resp_body, spec)
  }
) %>% 
  purrr::list_rbind()


# offset ------------------------------------------------------------------

responses <- request("https://pokeapi.co/api/v2/pokemon") %>% 
  req_paginate_offset(
    offset = in_query("offset"),
    results_field = "results",
    limit = in_query("limit", 150L),
    total_field = "count",
    max_pages = 20
  )

pokemon_df <- purrr::map(
  responses,
  \(resp_body) {
    tibblify(resp_body, spec)
  }
) %>% 
  purrr::list_rbind()

this doesn't return the full response but only the body. I guess in most cases the body is more relevant than the full response. This removes some clutter code from the user. Also, some pagination pattern require parsing the body anyway.
this currently focuses on a JSON body for simplicity. But this should already cover the most typical case for pagination.
this doesn't support returning intermediate results e.g. in case of an error (for interactive purposes it would also be nice to have intermediate results after an user interrupt).

mgirlich · 2023-08-15T07:15:45Z

@hadley Would be great to get feedback on this so that we can add some basic pagination support to httr2 soon 😄

hadley · 2023-08-15T19:05:55Z

Looking at this very quickly, I wonder if we should start a bit lower level so if needed you get greater control over the process. Maybe something like this?

library(tibblify)
req <- request("https://pokeapi.co/api/v2/pokemon") %>% 
  req_paginate(
    next_field = "next",
    results_field = "results",
    limit = in_query("limit", 150L),
    total_field = "count"
  )

repeat({
  resp <- req_perform(req)
  req <- resp_next_request(resp)
  if (is.null(req)) break
})

Once we have that in place, it would be trivial to build these one-size-fits-most helpers on top.

I'm not sure what req_paginate() should look like, but I suspect it should probably start with a callback that takes a request and a response, so you can modify the request using bits from the response.

mgirlich · 2023-08-18T11:33:15Z

Good point with the lower level version. The api now consists of:

req_paginate() which adds pagination policies to the request
req_paginate_next_url(), req_paginate_offset(), req_paginate_next_token() which are helpers for special cases
paginate_next_request() to create the request to the next page
paginate_n_pages() to calculate the total number of pages
paginate_perform() to perform the pagination requests and return a list of responses

There are also three functions needed that specify where in the request a parameter is set:

in_query(): add a parameter as query
in_body(): add a parameter in the body
in_header(): not sure if this is actually needed but I added it for the sake of completeness

They are needed in two ways:

request("https://pokeapi.co/api/v2/pokemon") %>% 
  req_paginate_offset(
    offset = in_query("offset"),
    page_size = in_query("limit", 150L),
    total = "count"
  )

The page size often isn't needed and could also be set before by the user. The advantage of having that in req_paginate() is that the total number of available pages can be calculated (in case the response contains the total number of elements).
The offset needs to be manipulated on every page.

R/paginate.R

mgirlich · 2023-08-18T11:41:09Z

R/paginate.R

+req_paginate <- function(req,
+                         next_request,
+                         page_size = NULL,
+                         total = NULL) {


Probably the total interface isn't flexible enough. The response might contain a) the number of elements (what I assume in this implementation) or b) the number of pages.

mgirlich · 2023-08-18T11:46:23Z

R/paginate.R

+  check_character(next_url)
+
+  next_request <- function(req, resp) {
+    body_parsed <- resp_body_json(resp)


It would be nice to cache the parsed body. Otherwise the body is usually parsed twice.

R/paginate.R

mgirlich · 2023-08-18T11:56:43Z

R/paginate.R

+#'
+#' responses <- paginate_perform(req_pokemon)
+paginate_perform <- function(req,
+                             max_pages = 20L,


It probably makes sense to add a callback data that is applied on the response and returns the data to store.

Agreed; but lets do that in the next PR.

hadley · 2023-08-23T20:47:45Z

R/paginate.R

+#' @export
+#'
+#' @examples
+#' in_query("start", 20)


I wonder if we need these helpers. It's not too awful to supply an anonymous function if there's an existing response helper:

page_size = \(resp) resp_url_query(resp, "limit", 150L), next_page = \(resp) resp_link_url(resp, "next")

Then we'd just need something extra in resp_body_json() that let you drill down to a specific component.

(That makes me wonder if resp_link_url() should actually be resp_header_link())

jonthegeek · 2023-08-25T12:50:13Z

This will be sooooooooo useful!

Is there anything I can do to help? If it's ready to kick the tires, I can try it out on a few APIs.

mgirlich · 2023-08-31T11:08:36Z

I removed the in_query() etc helpers and the api now looks like this:

library(tibblify)


# next url ----------------------------------------------------------------

req_next_url <- request("https://pokeapi.co/api/v2/pokemon") %>% 
  req_url_query(limit = 150) %>% 
  req_paginate_next_url(
    next_url = \(resp) resp_body_json(resp)[["next"]],
    n_pages = \(resp) {
      calculate_n_pages(
        page_size = 150,
        total = resp_body_json(resp)$count
      )
    }
  )

responses_next_url <- paginate_perform(req_next_url)

req_next_url_header <- request("https://api.github.com/repos/octocat/Spoon-Knife/issues") %>% 
  req_paginate_next_url(
    next_url = \(resp) resp_link_url(resp, "next")
  )
responses_next_url_header <- paginate_perform(req_next_url_header, max_pages = 3)


# offset ------------------------------------------------------------------

req_offset <- request("https://pokeapi.co/api/v2/pokemon") %>% 
  req_url_query(limit = 150) %>% 
  req_paginate_offset(
    offset = \(req, offset) req_url_query(req, offset = offset),
    page_size = 150,
    n_pages = \(resp) {
      calculate_n_pages(
        page_size = 150,
        total = resp_body_json(resp)$count
      )
    }
  )

responses_offset <- paginate_perform(req_offset)

The last example shows why it can be useful to have the in_query() and friends helpers. The page size was specified 3 times.

As mentioned above by @hadley it would be nice to have a helper to extract some information from the body, e.g. a new argument to resp_body_*(). But I think this should be done in a separate PR.

It would also be nice to cache the parsed body, as the parsing can be relatively expensive for bigger responses. I think this should also be done in a separate PR.

mgirlich · 2023-08-31T13:22:15Z

This will be sooooooooo useful!

Is there anything I can do to help? If it's ready to kick the tires, I can try it out on a few APIs.

I would be happy if you try it out and provide feedback.

R/paginate.R

jonthegeek · 2023-08-31T14:09:17Z

I would be happy if you try it out and provide feedback.

I'll give it a try later today!

hadley · 2023-08-31T20:02:47Z

@mgirlich looking at the interface now, I'd say in a future PR we should add an argument for parsing the body so the interface could look something like this:

req_next_url <- request("https://pokeapi.co/api/v2/pokemon") %>% 
  req_url_query(limit = 150) %>% 
  req_paginate_next_url(
    body = \(resp) resp_body_json(resp),
    next_url = \(resp, body) body[["next"]],
    n_pages = \(resp, body) ceiling(body$count / page_size)
  )

I'd suggest we don't expect calculate_n_pages() for now, and similarly worry about the the repeated page_size later.

R/paginate.R

hadley · 2023-08-31T20:05:10Z

R/paginate.R

+#' @param set_token A function that applies that applies the new token to the
+#'   request. It takes two arguments: a [request] and the new token.
+#' @param next_token A function that extracts the next token from the [response].
+#' @param n_pages A function that extracts the next token from the [response].


I think this is out of date?

hadley · 2023-08-31T20:05:39Z

R/paginate.R

+  )
+}
+
+#' Perform a paginated request


Should we document this with req_paginate()?

hadley · 2023-08-31T20:06:21Z

R/paginate.R

+#'
+#' responses <- paginate_perform(req_pokemon)
+paginate_perform <- function(req,
+                             max_pages = 20L,


Agreed; but lets do that in the next PR.

R/paginate.R

hadley · 2023-08-31T20:08:06Z

R/paginate.R

+  if (!req_policy_exists(req, "paginate")) {
+    cli::cli_abort(c(
+      "{.arg req} doesn't have a pagination policy",
+      i = "You can add pagination via `req_paginate()`."
+    ))
+  }


Maybe move this check to paginate_perform?

I thought it makes sense to also export paginate_next_request() so I left the check there but also added one to paginate_req_perform().

R/paginate.R

hadley · 2023-08-31T20:10:45Z

R/req-body.R

@@ -216,7 +216,8 @@ req_body_apply <- function(req) {
  } else if (type == "raw") {
    req <- req_body_apply_raw(req, data)
  } else if (type == "json") {
-    content_type <- "application/json"
+    # FIXME temporary workaround just for testing purposes. Remove before merging!


Need to fix now?

jonthegeek · 2023-08-31T21:45:43Z

I got it to work with Slack 🎉🎉🎉

Some comments:

It took me a while to find paginate_perform(). If we need a separate peform(), req_perform() should tell me that I have unused pagination info, ideally, and point me to the proper performer.
The n_pages help is the next_token help. I thought it might be my issue at first and wasn't sure what was supposed to go there.
Multipage response helpers would be helpful, but I'm still wrapping my head around whether they make sense (since the way to combine results probably depends a lot on the particular API).

Overall this worked great, it just could use some documentation tweaks. I'm very happy to see this in action!

mgirlich · 2023-09-01T07:41:37Z

It took me a while to find paginate_perform(). If we need a separate peform(), req_perform() should tell me that I have unused pagination info, ideally, and point me to the proper performer.

To inform in req_perform() sounds nice but this would also imply an extra argument to req_perform() to switch off the information in case you actually want to use req_perform(). For now I simply pointed to paginate_req_perform() more clearly in the documentation

Multipage response helpers would be helpful, but I'm still wrapping my head around whether they make sense (since the way to combine results probably depends a lot on the particular API).

Let's tackle that in a separate PR. I guess at least adding an extra callback for processing the response would make sense.

mgirlich added 2 commits August 15, 2023 06:45

Add workaround for req_body_json() and content type

75de903

Add req_paginate()

24997de

mgirlich force-pushed the paginate branch from edaacf6 to 24997de Compare August 15, 2023 07:14

mgirlich changed the title ~~Pagination support~~ Draft: Pagination support Aug 15, 2023

Fix workaround

8d44aec

mgirlich added 3 commits August 17, 2023 05:27

WIP

f327c33

Make req_paginate() more lower level

43f350a

Quick documentation

66fd240

Export in_query(), in_header(), and in_body()

1fe5359

mgirlich commented Aug 18, 2023

View reviewed changes

R/paginate.R Outdated Show resolved Hide resolved

mgirlich commented Aug 18, 2023

View reviewed changes

R/paginate.R Outdated Show resolved Hide resolved

Refactor

134b27e

mgirlich commented Aug 18, 2023

View reviewed changes

hadley reviewed Aug 23, 2023

View reviewed changes

mgirlich added 2 commits August 31, 2023 08:35

Change interface to anonymous functions

a940b0b

Fix documentation

daf53c6

Actually check arguments in check_function2()

76b08b1

mgirlich added 2 commits August 31, 2023 13:24

Add standalone cli

600c9da

No need for standalone cli

277a4dc

This was referenced Aug 31, 2023

Consider renaming resp_link_url() to resp_header_link() #296

Closed

Add argument args to check_function() r-lib/rlang#1652

Open

mgirlich commented Aug 31, 2023

View reviewed changes

R/paginate.R Outdated Show resolved Hide resolved

Add some basic tests

c2d708d

hadley approved these changes Aug 31, 2023

View reviewed changes

mgirlich added 15 commits September 1, 2023 05:03

Remove calculate_n_pages()

24de724

Improve documentation for req_paginate()

92b38dd

Rename to paginate_req_perform()

ec09637

Link to *_req_perform() from req_perform()

61a33ae

Export paginate_next_request()

aa2d77f

Check for pagination policy in paginate_req_perform()

ba3c937

Simplify req_paginate_offset()

f0063ed

Store offset in request

01a509e

Fix example for paginate_req_perform()

39252aa

Rename to req_paginate_token()

a35ee5e

Kind of support an infinite amount of pages

e5209cf

Add more tests

b3aef39

Fix test

e54a904

Avoid modern R syntax

1427759

More documentation tweaks

ddef0a6

mgirlich added 3 commits September 1, 2023 07:44

Add pagination to pkgdown yaml

f4b1766

Remove workaround

cf1f435

Fix pkgdown

9d87912

mgirlich merged commit d9696b0 into main Sep 1, 2023
12 checks passed

mgirlich changed the title ~~Draft: Pagination support~~ Add pagination Sep 1, 2023

mgirlich deleted the paginate branch September 1, 2023 08:12

This was referenced Sep 1, 2023

Add argument body to req_paginate() #297

Closed

Pagination #8

Closed

asadow mentioned this pull request Oct 10, 2023

Add warning to paginated request that was performed by req_perform() and not paginate_req_perform() #336

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add pagination #279

Add pagination #279

mgirlich commented Aug 15, 2023

mgirlich commented Aug 15, 2023

hadley commented Aug 15, 2023

mgirlich commented Aug 18, 2023

mgirlich Aug 18, 2023

mgirlich Aug 18, 2023

mgirlich Aug 18, 2023

hadley Aug 31, 2023

hadley Aug 23, 2023

hadley Aug 23, 2023

jonthegeek commented Aug 25, 2023

mgirlich commented Aug 31, 2023 •

edited

Loading

mgirlich commented Aug 31, 2023

jonthegeek commented Aug 31, 2023

hadley commented Aug 31, 2023

hadley Aug 31, 2023

hadley Aug 31, 2023

hadley Aug 31, 2023

hadley Aug 31, 2023

mgirlich Sep 1, 2023

hadley Aug 31, 2023

jonthegeek commented Aug 31, 2023

mgirlich commented Sep 1, 2023

Add pagination #279

Add pagination #279

Conversation

mgirlich commented Aug 15, 2023

mgirlich commented Aug 15, 2023

hadley commented Aug 15, 2023

mgirlich commented Aug 18, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jonthegeek commented Aug 25, 2023

mgirlich commented Aug 31, 2023 • edited Loading

mgirlich commented Aug 31, 2023

jonthegeek commented Aug 31, 2023

hadley commented Aug 31, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jonthegeek commented Aug 31, 2023

mgirlich commented Sep 1, 2023

mgirlich commented Aug 31, 2023 •

edited

Loading