[RFC] Provide an explicit caching API #947

bonjourmauko · 2020-02-08T17:31:12Z

Connected to #887
Connected to #891

What's inside?

A possible implementation of #891, trying to keep current contracts as they're now.

This is a proposition made to separate concerns between caching and storing logic, so as to move all caching related tasks from Holder to Cache, and all storing logic from everywhere to MemoryStorage and DiskStorage.

Assuming those low-level classes were meant to be private, the new Cache class respects the previous classes' contracts, and warns the user of deprecated methods to be definitely removed in the future.

Technical changes

Create MemoryStorage and move storage logic here.
Create DiskStorage and move storage logic here.
Extract eternalness checks somewhere else.
Merge current caching implementations into a single Cache class.
Add warning messages for deprecated methods.
Keep current contracts by dispatching to the new ones.

What's next?

Somehow get rid of the eternal / non-eternal variable check.
Move caching logic out of Holder to Cache, notably the 'lookup'
Preserve contracts of calculate, set_input*,
One simulation, one cache
Cache.lookup(function(variable, period)))
Git rid of Holder.

Thanks for your feedback!

.circleci/config.yml

openfisca_core/data_storage.py

tests/core/data_storage/test_on_disk_storage.py

tests/core/data_storage/test_in_memory_storage.py

bonjourmauko · 2020-02-10T22:51:46Z

@Morendil gave it another try after your initial feedback :) any further advise would be highly appreciated.

Morendil · 2020-02-11T09:20:26Z

This still isn't the right tack IMO. It leaves the existing relationships between instances untouched, with the Variable -> Holder -> Cache linkage roughly reproducing the Variable -> Holder -> Storage scheme.

To sketch it out briefly, I believe the Cache belongs to the Simulation (one Cache per Simulation), and functions as a "two-dimensional dictionary" - or to put it another way, a dictionary whose keys are tuples, specifically (variable, period) tuples; the values are Numpy arrays as usual. The logic to deal with eternal values will likely be a thin adapter in front of this dictionary. To preserve the functionality afforded by OnDiskStorage we want to be able to persist and reload this dictionary but this is a relatively trivial side-feature, not part of the core logic.

The Cache functionality should be relatively compact, in LOC terms, at parity with existing functionality; where it starts getting potentially more complex is when we add masking.

So the way I'd approach this work would be to take the point of view of the Simulation, and pretend Holders and Storage and all this stuff don't exist at all. From the perspective of the Simulation, all I need is a way to answer "do I already have a value for this variable at this period that I can use, or do I need to compute it", and of course a way to register "I've just computed a value for this variable at this period so I'll store it to avoid computing it again".

Separately, test that the Cache state of a simulation can be persisted and loaded again. This might be as simple as serializing and deserializing the array; while we're still not clear on how that feature is used in practice and exactly what needs it supports, I suspect the persistence medium is an implementation detail, and the current implementation doesn't best reflect those needs. So shunting that feature "to the side" gives us more flexibility.

Since this is all essentially a matter of side-effects mocks would make some sense to me in testing this; request the same computation twice on a Simulation and check that only the first request causes a call to calculate for instance, with the second instead resulting in a Cache lookup.

bonjourmauko · 2020-02-11T23:42:57Z

Thanks a lot, starts looking like a roadmap.

So the way I'd approach this work would be to take the point of view of the Simulation, and pretend Holders and Storage and all this stuff don't exist at all. From the perspective of the Simulation, all I need is a way to answer "do I already have a value for this variable at this period that I can use, or do I need to compute it", and of course a way to register "I've just computed a value for this variable at this period so I'll store it to avoid computing it again".

If calculacte is for example :

calculate = Callabe[[variable, period]] -> [...]
cache = Callavble[[f, Callabe[[variable, period]]] -> Callabe[[variable, period]] -> [...]

Separately, test that the Cache state of a simulation can be persisted and loaded again. This might be as simple as serializing and deserializing the array; while we're still not clear on how that feature is used in practice and exactly what needs it supports, I suspect the persistence medium is an implementation detail, and the current implementation doesn't best reflect those needs. So shunting that feature "to the side" gives us more flexibility.

DiskStorage always persists on disk, and you can dump and restore as expected, it works quite nicely (there's a test for that™). What happens is that all stored files are destroyed once the cache object is garbage collected (__del__(self)), unless you tell the cache to do not do so. Again here, most of this caching is variable caching, entities and so on are handled somewhere else ...

As for array serializing and deseriarializing, I think it's being do like :

numpy.dump(enum.posible_values).
EnumArray(file, enum).

In my humble prespespective, It would be simply to just:

hashed_array = tuple(map(tuple, array)
array = numpty.asarray(hashed_arraty)

### in the case you'd like to have

{(variable, periosd, hashed_mask): numpy.ndarray)}

One additional test would be to do a restore from a set of fixture files that are independent of the test case.

In otherwords, extract holders from calculate, for example, and wrap calculate with a HOF that does the job. At that point, if you suppress cache, thing will continue to work, just without cass, as there'll be no a single line of caching code inside the functions themselvs.

bonjourmauko · 2020-02-12T00:41:15Z

Here's what I've been thinking about masks, I can't grasp if I truly understood the idea @Morendil and @sandcha :

In [95]: v = 1                                                                                                                                                                                              

In [96]: p = 2                                                                                                                                                                                              

In [97]: m = tuple(map(bool, numpy.array([True, False, True])))                                                                                                                                             

In [98]: key = (v,p,m)                                                                                                                                                                                      

In [99]: state = {key: k}                                                                                                                                                                                   

In [100]: k = numpy.array([100, 200, 500])                                                                                                                                                                  

In [101]: numpy.array(key[2]) * state[key]                                                                                                                                                                  
Out[101]: array([100,   0, 500])

In this example, you only need the key portion to be hashable, for example of you want to cache transforming vectors of a vector as a part of the key.

bonjourmauko · 2020-02-12T01:24:33Z

Another example :

import numpy

from typing import NamedTuple


class Key(NamedTuple):
    variable = 2
    period = 3
    mask = numpy.array([True, True, False, True])


class Cache:

    state: {}

    def __init__(self):
        self.state = {}


    def put(self, key: NamedTuple, value: numpy.ndarray) -> None:
        self.state[(key.variable, key.period, tuple(map(bool, key.mask)))] = value


    def get(self, key: NamedTuple) -> {}:
        return self.state[(key.variable, key.period, tuple(map(bool, key.mask)))]


cache = Cache()
value = numpy.array([10, 20, 30, 40])
cache.put(Key(), value)
result = cache.get(Key())

print(cache.state)
print(result)
print(value)

>>> {(2, 3, (True, True, False, True)): array([10, 20, 30, 40])}
>>> [10 20 30 40]
>>> [10 20 30 40]

Morendil · 2020-02-12T09:09:04Z

@maukoquiroga With respect to masking see the prior discussion which outlined the algorithm.

When we spoke yesterday I implied that the mask would be part of the cache key but that was an oversimplification, so it's possible we'll end up dealing with just (variable, period) as the cache key and have to overlay the masking logic on top of that. Sorry for being misleading.

That doesn't diminish the value of simplifying the Holders/Storage tangle.

Also one important thing missing from that roadmap, which should have high priority, is figuring out how we're going to remove all the calls to get_holder from the France model (which as a side-effect will answer the question "how do we advise other country package maintainers deal with the breaking change of removing Holders").

There are two clusters: one of them around age computations (which is why I worked on the age-related PR in France in the first place), the other around "inversions", such as the Reform that computes gross salary based on net.

As @sandcha noted the way of dealing with ages that I suggested has some downsides, but it helps getting rid of get_holder calls; @sandcha has proposed that we implement masking so that we can perform the age computations in France without the awkwardness but implementing masking requires having that Cache API which requires getting rid of holders: we can't allow circular dependencies in our Mikado map! So, perhaps a better plan is to temporarily change the age computations to that awkward method and later make them more flexible again.

I would suggest that inversions should be removed entirely from France (and explicitly documented as unsupported) for the time being as they are not properly "reforms" (speculative encodings of counterfactual law) but explorations of an alternative calculate protocol. Inversions require fine control of the computation cache, since the entire point of them is to repeatedly calculate the same target value based on different inputs - which invalidates the choice of (variable, period) as a cache key. They are a feature which Core lacks proper support for at the moment but will become easier to support as a result of two things, a cache API and a way to compute dependencies statically.

Morendil · 2020-02-12T09:15:47Z

there'll be no a single line of caching code inside the functions

I don't think of that as a goal; it would certainly be nice. In practice, the caching logic is currently interleaved with the Spiral detection logic in Simulation::_calculate and I'm not quite convinced we can cleanly separate the two yet.

bonjourmauko · 2021-04-10T09:13:33Z

Thanks a lot @Morendil for your feedback, I'm closing as this RFC showed there might be another better strategy to deal with caching and getting rid of holders 😃

Mauko Quiroga added 8 commits February 8, 2020 14:32

Remove encoding header

5786410

Fix import order

d3493ae

Prefer {} to dict() as it is faster

f76a3f6

Use top-level module namespace

de6f0f0

Add test to storage init

362fc5c

Add test to storage get

d5aec2f

Add test to storage get when is eternal

d19841d

Add test to storage delete

c475f2a

bonjourmauko added the kind:refactor Refactoring and code cleanup label Feb 8, 2020

bonjourmauko requested a review from a team February 8, 2020 17:31

Mauko Quiroga added 2 commits February 8, 2020 20:45

Add test to get_known_periods

9fa7172

Add test to get_memory_usage

7e357ab

bonjourmauko force-pushed the add-cache-api branch from 50c8fb6 to 7e357ab Compare February 8, 2020 19:55

Reorder methods by public/private contract

1ba7934

bonjourmauko requested review from Morendil and sandcha February 9, 2020 18:33

bonjourmauko force-pushed the add-cache-api branch from ce19988 to a83feb2 Compare February 9, 2020 18:37

Use mocks for unit testing disk I/O

a83feb2

bonjourmauko force-pushed the add-cache-api branch 3 times, most recently from 479b878 to b1eea4b Compare February 9, 2020 19:09