Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

http caching for metadata #1222

Merged
merged 5 commits into from
Jul 10, 2023
Merged

http caching for metadata #1222

merged 5 commits into from
Jul 10, 2023

Conversation

melange396
Copy link
Collaborator

@melange396 melange396 commented Jun 28, 2023

allow and enable HTTP caching of metadata endpoint via "Cache-Control" header++
see: https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Cache-Control

this means clients dont have to keep making requests for the metadata when its not going to change.

the R lib doesnt seem to consider the "Age" header, but appears it will use the date instead, so it might sometimes use metadata that might be a smidge stale which shouldnt be a big deal.

@melange396 melange396 marked this pull request as ready for review June 28, 2023 23:34
@melange396
Copy link
Collaborator Author

much of this changeset is really just about piping http headers values through from the core endpoint work to where the Response object is created.

as is the case with a lot of the api server code, what was there already was kinda gnarly... i tried to improve much of it without breaking the evaluation order and operational flow (by improving variable names and reducing the scope of the generator to something more minimal), but it can still be improved a bunch more (particularly the logic in the generator method).

Copy link
Contributor

@krivard krivard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one required, couple discussions

if not metadata_list:
# the db table has a row, but there is no metadata about any signals in it
get_structured_logger('server_api').warning("empty entry in covidcast_meta cache")
return printer(_nonerator())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional: be nice to robot

if we wanted to not make sonarLint nervous we could replace this with

Suggested change
return printer(_nonerator())
return printer(x for x in [])

via experimentation:

>>> def foo():
...     return
...     yield
... 
>>> type(foo())
<class 'generator'>
>>> type(x for x in [])
<class 'generator'>
>>> list(foo())
[]
>>> list(x for x in [])
[]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alternately (since _nonerator is used on both L52 and L59) we could refactor to pull out the printer instead of just the generator:

def _null_printer():
    return printer(x for x in [])

...
    if not metadata_list:
        # the db table has a row, but there is no metadata about any signals in it
        get_structured_logger('server_api').warning("empty entry in covidcast_meta cache")
        return _null_printer()

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe im missing something, but i cant find where sonar flagged this...?

to convey intent, i prefer a named item over a weird empty comprehension just hanging out there. to that end, the _null_printer() could work. we might even want to put it into the printer module.

Copy link
Contributor

@krivard krivard Jul 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added an exception so it would build. the condition it fails is "no unreachable code" and the yield is unreachable.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as i said in another comment, its reachable enough for python to consider this a generator... but regardless, i changed it to the alternative empty generator.

Comment on lines 61 to 63
standard_age = 60 * 60 # expected metadata regeneration interval, in seconds (currently 60 mins)
# TODO: get this ^ from a config var? ideally, it should be set to the time between runs of
# src/acquisition/covidcast/covidcast_meta_cache_updater.py
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strongly recommended: change clarifying comment

I think this code is mixing up the metadata cache age (when did we last write to the metadata cache table) from the cache-control max-age (how long should the user hold on to this response before fetching again). We haven't cared about metadata cache age since late fall 2020, and should remove the associated checks.

For the cache-control max-age, we should consider the following factors:

  • the rate limiter resets every 1 hour
  • the meta cache is updated every 3 hours
  • new data is added only 7-8 times per day
  • the amount of new data added each day is small
  • since we are not in a public health emergency, there is no great need for up-to-the-minute accuracy, though of course it's nice to see the latest data, especially if you're trying to debug something

A 1-hour max-age is the shortest period that prevents unexpected 429s from otherwise-compliant use of the clients, given the above. It doesn't match the update cadence of the metadata cache table.

I don't expect us to need to adjust max-age until we are adding new data and recomputing metadata dozens of times a day, or we enter another public health emergency, and likely not until both are true. In any of those scenarios, making a change via a new release instead of a via a config variable is completely reasonable.

Suggested change
standard_age = 60 * 60 # expected metadata regeneration interval, in seconds (currently 60 mins)
# TODO: get this ^ from a config var? ideally, it should be set to the time between runs of
# src/acquisition/covidcast/covidcast_meta_cache_updater.py
# recommended rerequest period, in seconds (currently 60 minutes)
## this should be updated if a stale cache will have undue impact on user activities, such as
## if we start updating the metadata table much more frequently and having up-to-the-minute
## metadata accuracy becomes important to users once more.
standard_age = 60 * 60

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its not mixing up the ages: the client should hold on to the response until a subsequent fetch can be expected to return a new value, which is the sum of the time of the last write to the table plus the write interval.

it is independent of the rate limiting window, as the metadata we serve is quite literally a cache and this just decorating it with http directives to reveal accurate timing info to the client.

fwiw, the metadata is updated every 2h (search epidata_data.logger:"compute_covidcast_meta" or epidata_data.logger:"metadata_cache_updater" in elastic), and thus standard_age should actually be set to that. new commit coming shortly...

i do like your "this should be updated..." text though, and ill include that.

src/server/endpoints/covidcast_meta.py Outdated Show resolved Hide resolved
filter_fields(cache_entry_gen()),
headers={
"Cache-Control": f"max-age={standard_age}, public",
"Age": f"{age}",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue: this won't work as-is

We don't want this -- Age gets subtracted from Cache-Control.max-age when clients compute whether the response can be reused or not.

Suggested change
"Age": f"{age}",

Reference: https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Cache-Control#:~:text=Note%20that%20max,freshness%20lifetime.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i disagree. i think it will in fact work, especially because (as i mentioned above) the R lib doesnt even seem to respect the "Age" header.

and we do want this -- when the time elapsed since the resource was cached plus Age exceeds max-age, there is very likely to be a newly calculated metadata waiting to be retrieved (presuming max-age is set to the metadata regeneration interval).

alternatively, we could use "Cache-Control": f"max-age={max(standard_age-age, 1)}, public" to achieve a similar effect without using the "Age" header.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as i mentioned above

can you link me to this? i can't seem to find it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use the link from the opening comment ( https://github.com/r-lib/httr/blob/HEAD/R/cache.R ) and search for "age"

@melange396
Copy link
Collaborator Author

PTAL, added some wiggle room into the cacheability window that may assuage your concerns without compromising too much on behaving like a proper cache.

# empty generator that never yields
def _nonerator():
return
yield
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this line is unreachable and makes Sonar nervous

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its reachable enough for python to consider this a generator ¯\_(ツ)_/¯
we can probably replace return;yield with return (x for x in []), similar to what you suggested before. let me try that out.

filter_fields(cache_entry_gen()),
headers={
"Cache-Control": f"max-age={standard_age}, public",
"Age": f"{age}",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as i mentioned above

can you link me to this? i can't seem to find it.

@melange396
Copy link
Collaborator Author

@dsweber2 and i successfully tested this PR and cmu-delphi/covidcast#645 together in our dev environments -- the R client does not make a new request on subsequent calls to covidcast_meta() when the Cache-Control header is provided by the api server.

@sonarcloud
Copy link

sonarcloud bot commented Jul 10, 2023

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 4 Code Smells

No Coverage information No Coverage information
0.0% 0.0% Duplication

Copy link
Contributor

@krivard krivard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

@melange396 melange396 merged commit 59385b9 into dev Jul 10, 2023
@melange396 melange396 deleted the http_caching_metadata branch July 10, 2023 21:13
dmytrotsko pushed a commit that referenced this pull request Jul 26, 2023
allow http caching of metadata endpoint (plus a small margin), and removed staleness logging
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants