Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add pagination #279

Merged
merged 32 commits into from
Sep 1, 2023
Merged

Add pagination #279

merged 32 commits into from
Sep 1, 2023

Conversation

mgirlich
Copy link
Collaborator

Pagination support would be great to have. This PR provides basic support for pagination. It adds a general req_paginate() and three helpers for common patterns:

  • req_paginate_next_url(): the response contains a next url (usually in the body)
  • req_paginate_offset(): simple pagination with limit and offset (typically in the query)
  • req_paginate_next_token(): pagination via a token/cursor

I could test the first two helpers via the public Pokemon API, the last one only via an API that needs authorisation.

# next url ----------------------------------------------------------------

library(tibblify)
responses <- request("https://pokeapi.co/api/v2/pokemon") %>% 
  req_paginate_next_url(
    next_field = "next",
    results_field = "results",
    limit = in_query("limit", 150L),
    total_field = "count",
    max_pages = 2
  )

spec <- tspec_df(
  tib_chr("name"),
  tib_chr("url"),
)

pokemon_df <- purrr::map(
  responses,
  \(resp_body) {
    tibblify(resp_body, spec)
  }
) %>% 
  purrr::list_rbind()


# offset ------------------------------------------------------------------

responses <- request("https://pokeapi.co/api/v2/pokemon") %>% 
  req_paginate_offset(
    offset = in_query("offset"),
    results_field = "results",
    limit = in_query("limit", 150L),
    total_field = "count",
    max_pages = 20
  )

pokemon_df <- purrr::map(
  responses,
  \(resp_body) {
    tibblify(resp_body, spec)
  }
) %>% 
  purrr::list_rbind()
  • this doesn't return the full response but only the body. I guess in most cases the body is more relevant than the full response. This removes some clutter code from the user. Also, some pagination pattern require parsing the body anyway.
  • this currently focuses on a JSON body for simplicity. But this should already cover the most typical case for pagination.
  • this doesn't support returning intermediate results e.g. in case of an error (for interactive purposes it would also be nice to have intermediate results after an user interrupt).

@mgirlich
Copy link
Collaborator Author

@hadley Would be great to get feedback on this so that we can add some basic pagination support to httr2 soon 😄

@mgirlich mgirlich changed the title Pagination support Draft: Pagination support Aug 15, 2023
@hadley
Copy link
Member

hadley commented Aug 15, 2023

Looking at this very quickly, I wonder if we should start a bit lower level so if needed you get greater control over the process. Maybe something like this?

library(tibblify)
req <- request("https://pokeapi.co/api/v2/pokemon") %>% 
  req_paginate(
    next_field = "next",
    results_field = "results",
    limit = in_query("limit", 150L),
    total_field = "count"
  )

repeat({
  resp <- req_perform(req)
  req <- resp_next_request(resp)
  if (is.null(req)) break
})

Once we have that in place, it would be trivial to build these one-size-fits-most helpers on top.

I'm not sure what req_paginate() should look like, but I suspect it should probably start with a callback that takes a request and a response, so you can modify the request using bits from the response.

@mgirlich
Copy link
Collaborator Author

Good point with the lower level version. The api now consists of:

  • req_paginate() which adds pagination policies to the request
  • req_paginate_next_url(), req_paginate_offset(), req_paginate_next_token() which are helpers for special cases
  • paginate_next_request() to create the request to the next page
  • paginate_n_pages() to calculate the total number of pages
  • paginate_perform() to perform the pagination requests and return a list of responses

There are also three functions needed that specify where in the request a parameter is set:

  • in_query(): add a parameter as query
  • in_body(): add a parameter in the body
  • in_header(): not sure if this is actually needed but I added it for the sake of completeness

They are needed in two ways:

request("https://pokeapi.co/api/v2/pokemon") %>% 
  req_paginate_offset(
    offset = in_query("offset"),
    page_size = in_query("limit", 150L),
    total = "count"
  )

The page size often isn't needed and could also be set before by the user. The advantage of having that in req_paginate() is that the total number of available pages can be calculated (in case the response contains the total number of elements).
The offset needs to be manipulated on every page.

R/paginate.R Outdated Show resolved Hide resolved
R/paginate.R Outdated
req_paginate <- function(req,
next_request,
page_size = NULL,
total = NULL) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably the total interface isn't flexible enough. The response might contain a) the number of elements (what I assume in this implementation) or b) the number of pages.

R/paginate.R Outdated
check_character(next_url)

next_request <- function(req, resp) {
body_parsed <- resp_body_json(resp)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to cache the parsed body. Otherwise the body is usually parsed twice.

R/paginate.R Outdated Show resolved Hide resolved
R/paginate.R Outdated
#'
#' responses <- paginate_perform(req_pokemon)
paginate_perform <- function(req,
max_pages = 20L,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It probably makes sense to add a callback data that is applied on the response and returns the data to store.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed; but lets do that in the next PR.

R/paginate.R Outdated
#' @export
#'
#' @examples
#' in_query("start", 20)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we need these helpers. It's not too awful to supply an anonymous function if there's an existing response helper:

page_size = \(resp) resp_url_query(resp, "limit", 150L),
next_page = \(resp) resp_link_url(resp, "next")

Then we'd just need something extra in resp_body_json() that let you drill down to a specific component.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(That makes me wonder if resp_link_url() should actually be resp_header_link())

@jonthegeek
Copy link
Contributor

This will be sooooooooo useful!

Is there anything I can do to help? If it's ready to kick the tires, I can try it out on a few APIs.

@mgirlich
Copy link
Collaborator Author

mgirlich commented Aug 31, 2023

I removed the in_query() etc helpers and the api now looks like this:

library(tibblify)


# next url ----------------------------------------------------------------

req_next_url <- request("https://pokeapi.co/api/v2/pokemon") %>% 
  req_url_query(limit = 150) %>% 
  req_paginate_next_url(
    next_url = \(resp) resp_body_json(resp)[["next"]],
    n_pages = \(resp) {
      calculate_n_pages(
        page_size = 150,
        total = resp_body_json(resp)$count
      )
    }
  )

responses_next_url <- paginate_perform(req_next_url)

req_next_url_header <- request("https://api.github.com/repos/octocat/Spoon-Knife/issues") %>% 
  req_paginate_next_url(
    next_url = \(resp) resp_link_url(resp, "next")
  )
responses_next_url_header <- paginate_perform(req_next_url_header, max_pages = 3)


# offset ------------------------------------------------------------------

req_offset <- request("https://pokeapi.co/api/v2/pokemon") %>% 
  req_url_query(limit = 150) %>% 
  req_paginate_offset(
    offset = \(req, offset) req_url_query(req, offset = offset),
    page_size = 150,
    n_pages = \(resp) {
      calculate_n_pages(
        page_size = 150,
        total = resp_body_json(resp)$count
      )
    }
  )

responses_offset <- paginate_perform(req_offset)

The last example shows why it can be useful to have the in_query() and friends helpers. The page size was specified 3 times.

As mentioned above by @hadley it would be nice to have a helper to extract some information from the body, e.g. a new argument to resp_body_*(). But I think this should be done in a separate PR.

It would also be nice to cache the parsed body, as the parsing can be relatively expensive for bigger responses. I think this should also be done in a separate PR.

@mgirlich
Copy link
Collaborator Author

This will be sooooooooo useful!

Is there anything I can do to help? If it's ready to kick the tires, I can try it out on a few APIs.

I would be happy if you try it out and provide feedback.

R/paginate.R Outdated Show resolved Hide resolved
@jonthegeek
Copy link
Contributor

I would be happy if you try it out and provide feedback.

I'll give it a try later today!

@hadley
Copy link
Member

hadley commented Aug 31, 2023

@mgirlich looking at the interface now, I'd say in a future PR we should add an argument for parsing the body so the interface could look something like this:

req_next_url <- request("https://pokeapi.co/api/v2/pokemon") %>% 
  req_url_query(limit = 150) %>% 
  req_paginate_next_url(
    body = \(resp) resp_body_json(resp),
    next_url = \(resp, body) body[["next"]],
    n_pages = \(resp, body) ceiling(body$count / page_size)
  )

I'd suggest we don't expect calculate_n_pages() for now, and similarly worry about the the repeated page_size later.

R/paginate.R Show resolved Hide resolved
R/paginate.R Show resolved Hide resolved
R/paginate.R Outdated
#' @param set_token A function that applies that applies the new token to the
#' request. It takes two arguments: a [request] and the new token.
#' @param next_token A function that extracts the next token from the [response].
#' @param n_pages A function that extracts the next token from the [response].
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is out of date?

)
}

#' Perform a paginated request
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we document this with req_paginate()?

R/paginate.R Outdated
#'
#' responses <- paginate_perform(req_pokemon)
paginate_perform <- function(req,
max_pages = 20L,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed; but lets do that in the next PR.

R/paginate.R Outdated Show resolved Hide resolved
R/paginate.R Outdated
Comment on lines 124 to 129
if (!req_policy_exists(req, "paginate")) {
cli::cli_abort(c(
"{.arg req} doesn't have a pagination policy",
i = "You can add pagination via `req_paginate()`."
))
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe move this check to paginate_perform?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought it makes sense to also export paginate_next_request() so I left the check there but also added one to paginate_req_perform().

R/paginate.R Outdated Show resolved Hide resolved
R/paginate.R Outdated Show resolved Hide resolved
R/req-body.R Outdated
@@ -216,7 +216,8 @@ req_body_apply <- function(req) {
} else if (type == "raw") {
req <- req_body_apply_raw(req, data)
} else if (type == "json") {
content_type <- "application/json"
# FIXME temporary workaround just for testing purposes. Remove before merging!
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to fix now?

@jonthegeek
Copy link
Contributor

I got it to work with Slack 🎉🎉🎉

Some comments:

  • It took me a while to find paginate_perform(). If we need a separate peform(), req_perform() should tell me that I have unused pagination info, ideally, and point me to the proper performer.
  • The n_pages help is the next_token help. I thought it might be my issue at first and wasn't sure what was supposed to go there.
  • Multipage response helpers would be helpful, but I'm still wrapping my head around whether they make sense (since the way to combine results probably depends a lot on the particular API).

Overall this worked great, it just could use some documentation tweaks. I'm very happy to see this in action!

@mgirlich
Copy link
Collaborator Author

mgirlich commented Sep 1, 2023

  • It took me a while to find paginate_perform(). If we need a separate peform(), req_perform() should tell me that I have unused pagination info, ideally, and point me to the proper performer.

To inform in req_perform() sounds nice but this would also imply an extra argument to req_perform() to switch off the information in case you actually want to use req_perform(). For now I simply pointed to paginate_req_perform() more clearly in the documentation

  • Multipage response helpers would be helpful, but I'm still wrapping my head around whether they make sense (since the way to combine results probably depends a lot on the particular API).

Let's tackle that in a separate PR. I guess at least adding an extra callback for processing the response would make sense.

@mgirlich mgirlich merged commit d9696b0 into main Sep 1, 2023
12 checks passed
@mgirlich mgirlich changed the title Draft: Pagination support Add pagination Sep 1, 2023
@mgirlich mgirlich deleted the paginate branch September 1, 2023 08:12
This was referenced Sep 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants