Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add caching #153

Merged
merged 24 commits into from
Sep 6, 2023
Merged

add caching #153

merged 24 commits into from
Sep 6, 2023

Conversation

dsweber2
Copy link
Contributor

closes #29
still needs docs and tests, but it successfully loads. Exactly how to handle the persistent cache is tricky; going off of the R packages textbook, I used an environment.

R/epidatr-package.R Outdated Show resolved Hide resolved
@brookslogan
Copy link
Contributor

I forget what the epidatr committee said on this, but to me, caching smartly in epidatr is going to be super hard. Which might make this not a priority or need more discussion. (Maybe @dshemetov can remind of the nightmare this could become.)

@capnrefsmmat
Copy link
Contributor

I'd think the right way to do this is at the HTTP layer: have the Epidata server use Cache-Control headers to specify how long data is valid, and check these before re-requesting the same data. We did this with covidcast_meta over in this PR, but it requires cooperation from the server. It would also be hard to make this persistent.

Note also the CRAN policy:

Packages should not write in the user’s home filespace (including clipboards), nor anywhere else on the file system apart from the R session’s temporary directory (or during installation in the location pointed to by TMPDIR: and such usage should be cleaned up). Installing into the system’s R installation (e.g., scripts to its bin directory) is not allowed.

Limited exceptions may be allowed in interactive sessions if the package obtains confirmation from the user.

For R version 4.0 or later (hence a version dependency is required or only conditional use is possible), packages may store user-specific data, configuration and cache files in their respective user directories obtained from tools::R_user_dir(), provided that by default sizes are kept as small as possible and the contents are actively managed (including removing outdated material).

A possible non-persistent cache would store the httr response objects, and when a duplicate request is made, use rerequest() to have httr check the cache headers and re-fetch it as appropriate.

But I guess we'd have to consider the use case. Are people repeatedly running the same script that fetches the same data, in different R sessions? Or is the concern repeated fetches in a single session?

@dsweber2
Copy link
Contributor Author

Reading through #29 , it looks like y'all were trying to make a bespoke caching system, which significantly increases complexity. For the most part cachem handles most of that.

Currently by default the cache lives in here::here(".epidatr_cache"). Would that be covered by the "R session’s temporary directory`?

@dsweber2
Copy link
Contributor Author

I certainly agree that the right way to do this involves the server telling the client when the cache is invalid. Our clearing interval right now is defaulting to 7 days, which should be good enough for a first pass; any server/client changes in caching shouldn't involve major changes to the API, just background behavior, so figuring that out later if we want it should be reasonable.

@capnrefsmmat
Copy link
Contributor

Currently by default the cache lives in here::here(".epidatr_cache"). Would that be covered by the "R session’s temporary directory`?

No, the session temporary directory is the directory given by tempdir(). This is usually in /tmp or the OS equivalent, and isn't persistent between sessions. That's why they added the tools::R_user_dir() thing in R 4.0.

Our clearing interval right now is defaulting to 7 days, which should be good enough for a first pass

Wouldn't that be a problem if I'm requesting a signal that gets frequent backfill? The most conservative rule would seem to be "cache until the next time the pipeline is scheduled to run, in case it changes anything"; but for pipelines that don't do backfill, you could have a different rule.

@dsweber2
Copy link
Contributor Author

dsweber2 commented Aug 25, 2023

No, the session temporary directory is the directory given by tempdir(). This is usually in /tmp or the OS equivalent, and isn't persistent between sessions. That's why they added the tools::R_user_dir() thing in R 4.0.

Dang, I guess either we have to plead an exemption to use here::here() or set the default to a temp_dir() and throw a warning if people don't configure their own. Since you're the one likely to deal with The CRAN Folks, I'll defer to your decision on default location.

The most conservative rule would seem to be "cache until the next time the pipeline is scheduled to run, in case it changes anything"; but for pipelines that don't do backfill, you could have a different rule.

I'm not sure the client has good ways to figure out the update behavior for the server's pipeline. My plan for the first version of caching is that it's probably not something you want to use for anything younger than a month or so (the default is no caching). I figure we can eventually add in better tools to guarantee that it stays accurate on current data, and that this won't actually change the interface, just backend behavior.

@capnrefsmmat
Copy link
Contributor

I'm not sure the client has good ways to figure out the update behavior for the server's pipeline. My plan for the first version of caching is that it's probably not something you want to use for anything younger than a month or so (the default is no caching). I figure we can eventually add in better tools to guarantee that it stays accurate on current data, and that this won't actually change the interface, just backend behavior.

Right, that's why I like the idea of getting the server to send appropriate Cache-Control headers, so we can rely on those to give the right dates. That requires work on the delphi-epidata side, but ensures the cache doesn't cause problems when a user accidentally uses stale data.

Relying on the user to set caching options manually makes caching a footgun, where the wrong setting accidentally causes you to use bad data, or makes your results non-reproducible because they depend on a stale cached version of the data. And unfortunately non-Delphi users will have no idea what an appropriate cache lifetime is, because they don't know how often data is updated or backfilled, and won't be able to evaluate how a particular query might or might not be affected by backfill.

So I'll return to my previous question: What's the use case for caching? Specifically, who do we expect to need it, and what kinds of tasks are they performing that benefit from caching? That'll help decide which approach is best and how to present the feature so users don't mess it up.

@melange396
Copy link
Contributor

I agree with @capnrefsmmat, caching support on the server is a decent way to approach this and way more extensible than something in a specific client library.

Adding http Cache-Control headers to responses isnt hard in itself, but determining appropriate cache lifetimes is not so simple. We could assume things get updated daily, split the difference to give everything a 12 hour lifetime (or something arbitrary and shorter), and just hope that works well enough without being an inconvenience that keeps people from fresh data. That feels like a dirty hack. Another way we could do it is by keeping timing information in the server code/config that mirrors our job schedule and maps it to endpoints, but keeping that in sync isnt much fun and is hacky too. These are also both blanket approaches; the values of an old data point for a specific day in 2021 will probably never change again, but the cache timing will still apply to the whole dataset/endpoint/table/whatever and itll need to be re-requested when the cache expires.

Another option is the ETag header and its partner If-None-Match. Theres support for ETag in httr. It would work thusly: Whenever an acquisition job updates a table in our database, it generates a new updated ETag value associated with that table (a nonce or a hashed value, which is stored in some other shared table). Whenever an endpoint serves data from a table, it includes the table's ETag value as a header in the response, EXCEPT if the request included an If-None-Match header equal to the current ETag, in which case the endpoint simply returns code 304 Not Modified and an empty body (the endpoint doesnt even query the DB for the data that was requested). If an endpoint serves data from more than one table simultaneously, it could return a concatenation or sum of the constituents' ETags.

Using ETag has potential complications though:

  • It still requires that the client reach out across the network, so its really only efficient if the query takes a long time to process server-side, or if the result set is large and bandwidth is scarce.
  • It adds a blocking round trip between the API server and the DB server per re-request.
  • We may or may not want to exempt 304 responses from rate limiting counts (maybe a somewhat difficult decision but easy to implement).
  • It is again a kind of blanket approach for the same reasons as above.
    (pulldown for an aside on this)For a non-blanket approach, we could make the ETag a hash of the maximum of the issue dates of the rows that would be returned by the request... But, naively, that basically means doing the query already which defeats the purpose. Theres another trick that we could do involving a pre-query based only on indexes and only if the If-None-Match header is in the request, but now this is getting complicated, and it may not even add up to any time savings in the end.

The If-Modified-Since header exists too, but we probably want to avoid it so we can support POST for long requests.


Before we go into developing all of that, we might want to ask what good it will do:

  • How often does a person make an identical request within some relatively short period of time?
  • When they do make a duplicate request, will it actually save them much time? the mean and median of server processing time for non-browser requests are 40ms and 10ms respectively, and those times are nigh imperceptible to [most?] people (transmission and traversal of the internet will add some time to these numbers, but thats dependent on things outside of our control, and modern networks are pretty fast).
  • Unless the user is using a caching proxy (which is probably quite unlikely), the cached resources are only good for the lifetime of their program's execution. If theyre running a script over and over again, its going to lose state and have to re-request with each new run.
  • Small changes can cause cache misses. Requests with overlapping parameters dont benefit from each other, and duplicated data will have to be re-transmitted. Perhaps a user in an interactive session makes a large and slow query across all counties in the nation and with a wide time window, which takes 2 minutes. Then they realize they need to add an additional day to the beginning and end of the window. With or without caching available, im somewhat ashamed to admit that i would probably just end up willfully reissuing a new query with a wider window and wait 2 more minutes, instead of issuing a query for the missing days and reassembling them all. In fact, with a little luck, the new wider query might only take a fraction of the 2 minutes thanks to disk cache on the DB server.

Additionally, what is the cost of maintaining a cache on the client side? many requests will NOT be duplicated, so those cached values will be sitting somewhere on the user's machine (in memory or on disk), taking up space for no good reason. If someone makes a call for 300MB of data from us, it will take up something like ~600MB on their machine: 300MB in the cache, and 300MB in the dataframe theyre working with. If they make 50 different calls (one for each state, perhaps) for 300MB each, intending to process each chunk piecemeal so that it only occupies 300MB of memory at a time and then throw it away before moving on to the next, it will actually take up 15GB behind their back. Those numbers may not be an accurate estimation of real usage, and disk and ram are relatively cheap/plentiful these days, but this could be a concern for some. To make things worse, as far as i can tell, httr never deletes from its cache.

We added Cache-Control to the metadata endpoint because it solved a problem and made us compatible with CRAN. On the other hand, this endeavor sounds like a premature optimization. I think its probably not worth it unless there is a demonstrable need -- unless, of course, its purely academic and someone wants to do it for the challenge or the fun of it (and they promise to be responsible for maintaining it in the future 😜). You might start by looking for evidence of efficacy by pulling (source ip, request url, timestamp, processing time, content length) from server logs and analyzing it, but im going to guess thats more work for small potential gains.

Since i did write this novel of a comment, i think i will create an issue in delphi-epidata regarding caching support and link back to here. Composing all this also made me want to redo the existing metadata Cache-Control stuff using ETag, but its very unnecessary and probably even a waste of time -- what we have now "works".


For an off-the-wall alternative that involves no code changes to our clients or server, i think it should be possible for a user to set up a local squid caching proxy such that it intercepts calls to our API, caches them for a configurable period of time (even without caching headers coming from our server), persists between program executions, removes expired entries, enforces a limit on the size of the cache, and even saves logs about cache hits and misses to analyze its effectiveness... But the incantation for that is not in my current spellbook, and is left as an exercise to the reader.

@brookslogan
Copy link
Contributor

brookslogan commented Aug 28, 2023

My use cases:

  • I want to have access to all history data [for a particular signal x geoset, probably all time values]. I want old history to be cached to disk, and new history to be requested regularly and added to the cache. This involves custom caching code, and in the past seemed easiest to do external to epidatr in epiprocess or another package. [But can't detect longer-back revision events without server help. Really it'd be nice to have tri-temporal DB and be able to request updates since last system time that data was returned for. Or maybe a hack towards that: (i) queries return something like system time alongside, (ii) we can query a coarse description of changes, like the something-like-system-time x endpoint x signal x first issue affected x last issue affected x first time value affected x last time value affected.]
  • I want to write a snippet that queries some data and does stuff, and re-execute that snippet a bunch of times as I tweak and add onto it. I want the query to be cached to RAM [NOT disk]. I'll do the same exact query every time. Might be a "regular" query or a history query depending on the snippets. [I'd probably be fine with the cache lasting the entire session. But it would be nice if it were reprex-compatible.]

[Since the updating-history use case involves more trouble to work around, it seems like it would provide more value.]

@dsweber2
Copy link
Contributor Author

Just to have these all in a persistent place, here are some comments about potential use-cases from slack.
From Ryan:

For me, it’s a one R session kind of thing. If I’m thinking carefully and writing something that would be worth revisiting in (say) a week, then I’d probably manually implement caching (or use an R notebook with caching turned on).
So if I’m just playing around a bit in a given R session then this where I’d imagine caching is most useful. I would say that caching when as_of is actually set (to something else than NULL) is probably the only thing that makes sense, since otherwise you might be falsely caching and not actually returning up to date data.

But just to reiterate something I mentioned in the meeting, I don’t think we should hold off the first release to implement this.

From Dan:

My use case would be quite similar. One session should be sufficient, and even then, it would be convenient, but not crucial, and rare. So I don’t think it should hold up the release.

@dsweber2
Copy link
Contributor Author

dsweber2 commented Aug 28, 2023

I think Ryan's suggestion of only caching requests with as_of or issues set specifically, and making caching off by default shaves off 95% of the potential foot-guns this has without requiring maintenance on the server side. As far as I can tell, the only thing that interferes with those are faults in the record system.

I'm a bit surprised by people wanting the cache in memory rather than storage, but I suppose that's b/c I'm thinking about production use-cases.

@dshemetov
Copy link
Contributor

dshemetov commented Aug 28, 2023

FWIW, I have needed persistent non-memory cache. My use cases have been:

  1. in production forecasting, calibrating forecasters need the history of another forecaster's predictions, which requires a lot of as_of data, so there we store signal = x, as_of = t, geo_value = *, time_value = * as a file and can query it any time we need
  2. in production forecasting, since our pipeline still uses evalcast and data downloading is done per-forecaster, caching saves us time by not downloading the same data again for every forecaster
  3. in evaluation, we need as_of data as in case (1); this cache needs to persistent across runs across multiple days
  4. in debugging forecasting, partial pipeline runs are common, where a later forecaster forecaster fails, so data caching helps speed up iterations without needing to rewrite the script; this cache needs to persistent across runs across multiple days

Our previous implementation of caching in evalcast is here.

  • it's opt in, default off
  • stores and queries cache files based on a tag of signal, as_of, geo_type; encourage the user to make the first cache call with geo_value=*, time_value=*, so further calls can subset on complete data
  • meets my use cases above

@brookslogan
Copy link
Contributor

@dshemetov My first use case is your 1. And it'd be nice for any solution to solve your 1--3. Pre-evalcast-caching it seemed like this would be best to do with specialized archives, since if your data source doesn't have a lot of revisions, you could save on space/bandwidth by using issues= queries instead of as_of queries.

My second use case + Dmitry's 4. points to wanting to be able to configure whether it's RAM or disk. Maybe by just specifying a cachem cache?

@dsweber2
Copy link
Contributor Author

cachem has both a cache_mem and a cache_disk function; it wouldn't be too hard to add that as a config, though the two functions have some differences in options. In the interest of simplicity, does the fact that the cache has a max size and will have to be set to a global level address your concerns with having it in storage rather than memory?

At the moment, I'm thinking only calls where either as_of or issues are specified (and not * in the case of issues) are cached, and storing it either in a temp directory, or a global user specified location.

@brookslogan
Copy link
Contributor

(Migrating some notes from meeting:

  1. I think a single cache + a limit on cache size does solve my concerns.
  2. Might want to check out what happens when a single object exceeds the cache size.
  3. as_of and issues queries can also be invalidated by new issues arriving or by patches.
  4. Short cache invalidation time might help avoid issues.
  5. Caching to be opt-in.)

One situation from 3. to perhaps mention in the docs... Suppose I write make_forecast = function(forecast_date) { ... <query using as_of = forecast_date> ... }, and do make_forecast(Sys.Date()). When using as_of = <today>, just like with regular current-version queries, more data can show up later in the day. So

  • if a user wants to keep running with the same set of data, they need to make sure your cache entries stick around (long invalidation period, large/unlimited cache size) / are saved externally
  • if a user wants to run with the very latest available, they need to avoid caching / make sure their cache entries are cleared.
    Short version might be "if you are querying data as_of = <today>, think carefully about whether you want to use a cache and what the cache settings should be".

DESCRIPTION Show resolved Hide resolved
R/cache.R Outdated Show resolved Hide resolved
R/cache.R Outdated Show resolved Hide resolved
R/cache.R Show resolved Hide resolved
R/cache.R Outdated
#' the guts of caching, its interposed between fetch and the specific fetch methods. Internal method only.
#'
#' @param call the `epidata_call` object
#' @inheritParams fetch
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#' @inheritParams fetch
#' @inheritParams fetch
#' @keywords internal

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, on second thought, it shouldn't be inheriting from fetch, that was before you made fetch_args_list

R/cache.R Outdated Show resolved Hide resolved
R/epidatacall.R Show resolved Hide resolved
@dsweber2
Copy link
Contributor Author

dsweber2 commented Sep 5, 2023

Ok, so I think this is ready; caching is off by default, and only works for calls containing as_of and issues; if the user tries to cache young data (e.g. something like as_of=today()) they get a warning. The cache is a bit dumb, and stores every unique call separately, but does have a memory footprint maximum for a global cache.

For details, you may want to read the docs for set_cache; it houses the general description of caching behavior.

@dshemetov dshemetov marked this pull request as ready for review September 5, 2023 22:08
@dshemetov dshemetov merged commit 1e45282 into dev Sep 6, 2023
@dshemetov dshemetov deleted the dsweber2/caching branch September 6, 2023 20:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Consider adding download caching to the client
5 participants