-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add caching #153
add caching #153
Conversation
I forget what the epidatr committee said on this, but to me, caching smartly in epidatr is going to be super hard. Which might make this not a priority or need more discussion. (Maybe @dshemetov can remind of the nightmare this could become.) |
I'd think the right way to do this is at the HTTP layer: have the Epidata server use Cache-Control headers to specify how long data is valid, and check these before re-requesting the same data. We did this with covidcast_meta over in this PR, but it requires cooperation from the server. It would also be hard to make this persistent. Note also the CRAN policy:
A possible non-persistent cache would store the httr response objects, and when a duplicate request is made, use But I guess we'd have to consider the use case. Are people repeatedly running the same script that fetches the same data, in different R sessions? Or is the concern repeated fetches in a single session? |
Reading through #29 , it looks like y'all were trying to make a bespoke caching system, which significantly increases complexity. For the most part Currently by default the cache lives in |
I certainly agree that the right way to do this involves the server telling the client when the cache is invalid. Our clearing interval right now is defaulting to 7 days, which should be good enough for a first pass; any server/client changes in caching shouldn't involve major changes to the API, just background behavior, so figuring that out later if we want it should be reasonable. |
No, the session temporary directory is the directory given by
Wouldn't that be a problem if I'm requesting a signal that gets frequent backfill? The most conservative rule would seem to be "cache until the next time the pipeline is scheduled to run, in case it changes anything"; but for pipelines that don't do backfill, you could have a different rule. |
Dang, I guess either we have to plead an exemption to use
I'm not sure the client has good ways to figure out the update behavior for the server's pipeline. My plan for the first version of caching is that it's probably not something you want to use for anything younger than a month or so (the default is no caching). I figure we can eventually add in better tools to guarantee that it stays accurate on current data, and that this won't actually change the interface, just backend behavior. |
Right, that's why I like the idea of getting the server to send appropriate Cache-Control headers, so we can rely on those to give the right dates. That requires work on the delphi-epidata side, but ensures the cache doesn't cause problems when a user accidentally uses stale data. Relying on the user to set caching options manually makes caching a footgun, where the wrong setting accidentally causes you to use bad data, or makes your results non-reproducible because they depend on a stale cached version of the data. And unfortunately non-Delphi users will have no idea what an appropriate cache lifetime is, because they don't know how often data is updated or backfilled, and won't be able to evaluate how a particular query might or might not be affected by backfill. So I'll return to my previous question: What's the use case for caching? Specifically, who do we expect to need it, and what kinds of tasks are they performing that benefit from caching? That'll help decide which approach is best and how to present the feature so users don't mess it up. |
I agree with @capnrefsmmat, caching support on the server is a decent way to approach this and way more extensible than something in a specific client library. Adding http Another option is the Using ETag has potential complications though:
The Before we go into developing all of that, we might want to ask what good it will do:
Additionally, what is the cost of maintaining a cache on the client side? many requests will NOT be duplicated, so those cached values will be sitting somewhere on the user's machine (in memory or on disk), taking up space for no good reason. If someone makes a call for 300MB of data from us, it will take up something like ~600MB on their machine: 300MB in the cache, and 300MB in the dataframe theyre working with. If they make 50 different calls (one for each state, perhaps) for 300MB each, intending to process each chunk piecemeal so that it only occupies 300MB of memory at a time and then throw it away before moving on to the next, it will actually take up 15GB behind their back. Those numbers may not be an accurate estimation of real usage, and disk and ram are relatively cheap/plentiful these days, but this could be a concern for some. To make things worse, as far as i can tell, We added Since i did write this novel of a comment, i think i will create an issue in delphi-epidata regarding caching support and link back to here. Composing all this also made me want to redo the existing metadata For an off-the-wall alternative that involves no code changes to our clients or server, i think it should be possible for a user to set up a local squid caching proxy such that it intercepts calls to our API, caches them for a configurable period of time (even without caching headers coming from our server), persists between program executions, removes expired entries, enforces a limit on the size of the cache, and even saves logs about cache hits and misses to analyze its effectiveness... But the incantation for that is not in my current spellbook, and is left as an exercise to the reader. |
My use cases:
[Since the updating-history use case involves more trouble to work around, it seems like it would provide more value.] |
Just to have these all in a persistent place, here are some comments about potential use-cases from slack.
From Dan:
|
I think Ryan's suggestion of only caching requests with I'm a bit surprised by people wanting the cache in memory rather than storage, but I suppose that's b/c I'm thinking about production use-cases. |
FWIW, I have needed persistent non-memory cache. My use cases have been:
Our previous implementation of caching in evalcast is here.
|
@dshemetov My first use case is your 1. And it'd be nice for any solution to solve your 1--3. Pre-evalcast-caching it seemed like this would be best to do with specialized archives, since if your data source doesn't have a lot of revisions, you could save on space/bandwidth by using issues= queries instead of as_of queries. My second use case + Dmitry's 4. points to wanting to be able to configure whether it's RAM or disk. Maybe by just specifying a cachem cache? |
At the moment, I'm thinking only calls where either |
a32dd08
to
7b5f470
Compare
(Migrating some notes from meeting:
One situation from 3. to perhaps mention in the docs... Suppose I write
|
R/cache.R
Outdated
#' the guts of caching, its interposed between fetch and the specific fetch methods. Internal method only. | ||
#' | ||
#' @param call the `epidata_call` object | ||
#' @inheritParams fetch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#' @inheritParams fetch | |
#' @inheritParams fetch | |
#' @keywords internal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, on second thought, it shouldn't be inheriting from fetch
, that was before you made fetch_args_list
b9ac762
to
f4d0584
Compare
only cache when the call inlcudes an `as_of` or `issues`, neither of which can be `*`.
0a91142
to
f7a46b3
Compare
Ok, so I think this is ready; caching is off by default, and only works for calls containing For details, you may want to read the docs for |
closes #29
still needs docs and tests, but it successfully loads. Exactly how to handle the persistent cache is tricky; going off of the R packages textbook, I used an environment.