Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Provide an explicit caching API #947

Closed
wants to merge 26 commits into from
Closed

Conversation

bonjourmauko
Copy link
Member

@bonjourmauko bonjourmauko commented Feb 8, 2020

Connected to #887
Connected to #891

What's inside?

A possible implementation of #891, trying to keep current contracts as they're now.

This is a proposition made to separate concerns between caching and storing logic, so as to move all caching related tasks from Holder to Cache, and all storing logic from everywhere to MemoryStorage and DiskStorage.

Assuming those low-level classes were meant to be private, the new Cache class respects the previous classes' contracts, and warns the user of deprecated methods to be definitely removed in the future.

Technical changes

  • Create MemoryStorage and move storage logic here.
  • Create DiskStorage and move storage logic here.
  • Extract eternalness checks somewhere else.
  • Merge current caching implementations into a single Cache class.
  • Add warning messages for deprecated methods.
  • Keep current contracts by dispatching to the new ones.

What's next?

  • Somehow get rid of the eternal / non-eternal variable check.
  • Move caching logic out of Holder to Cache, notably the 'lookup'
  • Preserve contracts of calculate, set_input*,
  • One simulation, one cache
  • Cache.lookup(function(variable, period)))
  • Git rid of Holder.

Thanks for your feedback!

@bonjourmauko bonjourmauko added the kind:refactor Refactoring and code cleanup label Feb 8, 2020
@bonjourmauko bonjourmauko requested a review from a team February 8, 2020 17:31
@bonjourmauko bonjourmauko force-pushed the add-cache-api branch 3 times, most recently from 479b878 to b1eea4b Compare February 9, 2020 19:09
.circleci/config.yml Outdated Show resolved Hide resolved
@bonjourmauko bonjourmauko changed the title [WIP] Provide an explicit caching API Provide an explicit caching API Feb 10, 2020
@bonjourmauko bonjourmauko added kind:discovery Issue requires discovery: value, ux and tech and removed kind:refactor Refactoring and code cleanup labels Feb 10, 2020
@bonjourmauko
Copy link
Member Author

@Morendil gave it another try after your initial feedback :) any further advise would be highly appreciated.

@bonjourmauko bonjourmauko requested a review from a team February 10, 2020 22:52
@bonjourmauko bonjourmauko force-pushed the add-cache-api branch 2 times, most recently from 3195a15 to 1cad5b7 Compare February 10, 2020 23:01
@Morendil
Copy link
Contributor

This still isn't the right tack IMO. It leaves the existing relationships between instances untouched, with the Variable -> Holder -> Cache linkage roughly reproducing the Variable -> Holder -> Storage scheme.

To sketch it out briefly, I believe the Cache belongs to the Simulation (one Cache per Simulation), and functions as a "two-dimensional dictionary" - or to put it another way, a dictionary whose keys are tuples, specifically (variable, period) tuples; the values are Numpy arrays as usual. The logic to deal with eternal values will likely be a thin adapter in front of this dictionary. To preserve the functionality afforded by OnDiskStorage we want to be able to persist and reload this dictionary but this is a relatively trivial side-feature, not part of the core logic.

The Cache functionality should be relatively compact, in LOC terms, at parity with existing functionality; where it starts getting potentially more complex is when we add masking.

So the way I'd approach this work would be to take the point of view of the Simulation, and pretend Holders and Storage and all this stuff don't exist at all. From the perspective of the Simulation, all I need is a way to answer "do I already have a value for this variable at this period that I can use, or do I need to compute it", and of course a way to register "I've just computed a value for this variable at this period so I'll store it to avoid computing it again".

Separately, test that the Cache state of a simulation can be persisted and loaded again. This might be as simple as serializing and deserializing the array; while we're still not clear on how that feature is used in practice and exactly what needs it supports, I suspect the persistence medium is an implementation detail, and the current implementation doesn't best reflect those needs. So shunting that feature "to the side" gives us more flexibility.

Since this is all essentially a matter of side-effects mocks would make some sense to me in testing this; request the same computation twice on a Simulation and check that only the first request causes a call to calculate for instance, with the second instead resulting in a Cache lookup.

@bonjourmauko bonjourmauko changed the title Provide an explicit caching API [WIP] Provide an explicit caching API Feb 11, 2020
@bonjourmauko
Copy link
Member Author

bonjourmauko commented Feb 11, 2020

Thanks a lot, starts looking like a roadmap.

So the way I'd approach this work would be to take the point of view of the Simulation, and pretend Holders and Storage and all this stuff don't exist at all. From the perspective of the Simulation, all I need is a way to answer "do I already have a value for this variable at this period that I can use, or do I need to compute it", and of course a way to register "I've just computed a value for this variable at this period so I'll store it to avoid computing it again".

If calculacte is for example :

calculate = Callabe[[variable, period]] -> [...]
cache = Callavble[[f, Callabe[[variable, period]]] -> Callabe[[variable, period]] -> [...]

Separately, test that the Cache state of a simulation can be persisted and loaded again. This might be as simple as serializing and deserializing the array; while we're still not clear on how that feature is used in practice and exactly what needs it supports, I suspect the persistence medium is an implementation detail, and the current implementation doesn't best reflect those needs. So shunting that feature "to the side" gives us more flexibility.

DiskStorage always persists on disk, and you can dump and restore as expected, it works quite nicely (there's a test for that™). What happens is that all stored files are destroyed once the cache object is garbage collected (__del__(self)), unless you tell the cache to do not do so. Again here, most of this caching is variable caching, entities and so on are handled somewhere else ...

As for array serializing and deseriarializing, I think it's being do like :

numpy.dump(enum.posible_values).
EnumArray(file, enum).

In my humble prespespective, It would be simply to just:

hashed_array = tuple(map(tuple, array)
array = numpty.asarray(hashed_arraty)

### in the case you'd like to have

{(variable, periosd, hashed_mask): numpy.ndarray)}

One additional test would be to do a restore from a set of fixture files that are independent of the test case.


In otherwords, extract holders from calculate, for example, and wrap calculate with a HOF that does the job. At that point, if you suppress cache, thing will continue to work, just without cass, as there'll be no a single line of caching code inside the functions themselvs.

@bonjourmauko
Copy link
Member Author

bonjourmauko commented Feb 12, 2020

Here's what I've been thinking about masks, I can't grasp if I truly understood the idea @Morendil and @sandcha :

In [95]: v = 1                                                                                                                                                                                              

In [96]: p = 2                                                                                                                                                                                              

In [97]: m = tuple(map(bool, numpy.array([True, False, True])))                                                                                                                                             

In [98]: key = (v,p,m)                                                                                                                                                                                      

In [99]: state = {key: k}                                                                                                                                                                                   

In [100]: k = numpy.array([100, 200, 500])                                                                                                                                                                  

In [101]: numpy.array(key[2]) * state[key]                                                                                                                                                                  
Out[101]: array([100,   0, 500])

In this example, you only need the key portion to be hashable, for example of you want to cache transforming vectors of a vector as a part of the key.

@bonjourmauko
Copy link
Member Author

bonjourmauko commented Feb 12, 2020

Another example :

import numpy

from typing import NamedTuple


class Key(NamedTuple):
    variable = 2
    period = 3
    mask = numpy.array([True, True, False, True])


class Cache:

    state: {}

    def __init__(self):
        self.state = {}


    def put(self, key: NamedTuple, value: numpy.ndarray) -> None:
        self.state[(key.variable, key.period, tuple(map(bool, key.mask)))] = value


    def get(self, key: NamedTuple) -> {}:
        return self.state[(key.variable, key.period, tuple(map(bool, key.mask)))]


cache = Cache()
value = numpy.array([10, 20, 30, 40])
cache.put(Key(), value)
result = cache.get(Key())

print(cache.state)
print(result)
print(value)

>>> {(2, 3, (True, True, False, True)): array([10, 20, 30, 40])}
>>> [10 20 30 40]
>>> [10 20 30 40]

@Morendil
Copy link
Contributor

@maukoquiroga With respect to masking see the prior discussion which outlined the algorithm.

When we spoke yesterday I implied that the mask would be part of the cache key but that was an oversimplification, so it's possible we'll end up dealing with just (variable, period) as the cache key and have to overlay the masking logic on top of that. Sorry for being misleading.

That doesn't diminish the value of simplifying the Holders/Storage tangle.

Also one important thing missing from that roadmap, which should have high priority, is figuring out how we're going to remove all the calls to get_holder from the France model (which as a side-effect will answer the question "how do we advise other country package maintainers deal with the breaking change of removing Holders").

There are two clusters: one of them around age computations (which is why I worked on the age-related PR in France in the first place), the other around "inversions", such as the Reform that computes gross salary based on net.

As @sandcha noted the way of dealing with ages that I suggested has some downsides, but it helps getting rid of get_holder calls; @sandcha has proposed that we implement masking so that we can perform the age computations in France without the awkwardness but implementing masking requires having that Cache API which requires getting rid of holders: we can't allow circular dependencies in our Mikado map! So, perhaps a better plan is to temporarily change the age computations to that awkward method and later make them more flexible again.

I would suggest that inversions should be removed entirely from France (and explicitly documented as unsupported) for the time being as they are not properly "reforms" (speculative encodings of counterfactual law) but explorations of an alternative calculate protocol. Inversions require fine control of the computation cache, since the entire point of them is to repeatedly calculate the same target value based on different inputs - which invalidates the choice of (variable, period) as a cache key. They are a feature which Core lacks proper support for at the moment but will become easier to support as a result of two things, a cache API and a way to compute dependencies statically.

@Morendil
Copy link
Contributor

there'll be no a single line of caching code inside the functions

I don't think of that as a goal; it would certainly be nice. In practice, the caching logic is currently interleaved with the Spiral detection logic in Simulation::_calculate and I'm not quite convinced we can cleanly separate the two yet.

@bonjourmauko bonjourmauko marked this pull request as draft March 31, 2021 08:15
@bonjourmauko bonjourmauko changed the title [WIP] Provide an explicit caching API Provide an explicit caching API Mar 31, 2021
@bonjourmauko bonjourmauko added the policy:rfc Request For Comments: chime in! label Apr 1, 2021
@bonjourmauko bonjourmauko changed the title Provide an explicit caching API [WIP/RFC] Provide an explicit caching API Apr 1, 2021
@bonjourmauko
Copy link
Member Author

Thanks a lot @Morendil for your feedback, I'm closing as this RFC showed there might be another better strategy to deal with caching and getting rid of holders 😃

@bonjourmauko bonjourmauko changed the title [WIP/RFC] Provide an explicit caching API [RFC] Provide an explicit caching API Apr 10, 2021
@bonjourmauko bonjourmauko deleted the add-cache-api branch September 29, 2021 12:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind:discovery Issue requires discovery: value, ux and tech policy:rfc Request For Comments: chime in!
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants