Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix axes expand mecanism #933

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

Fix axes expand mecanism #933

wants to merge 2 commits into from

Conversation

guillett
Copy link
Member

@guillett guillett commented Jan 3, 2020

* suffixes are axe indexes instead of just element index in expanded array
assert simulation_builder.get_ids('households') == ['housea0', 'houseb1', 'housea2', 'houseb3']
assert simulation_builder.get_input('rent', '2018-11') == approx([0, 0, 3000, 0])
assert simulation_builder.get_ids('households') == ['housea0', 'houseb0', 'housea1', 'houseb1']
assert simulation_builder.get_input('rent', '2018-11') == approx([0, 0, 3000, 3000])
Copy link
Member Author

@guillett guillett Jan 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@benjello your insight would be appreciated. What would you expect in given this input data? (You will have to expand the visible code snippet to get the full context)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never used it in that case (or i do not remember). I think I would stick to your convention.

Copy link
Member Author

@guillett guillett Jan 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@benjello 👌 :)
@Morendil do you recall why you pick that expectation for get_input('rent', '2018-11')? Many thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's hard to say because this was quite some time ago, but it's safest to assume that this was to reproduce the behaviour of the existing implementation, as I understood it (perhaps poorly).

The problem here is that the YAML specification for axes has lots of ambiguity. When using axes, we replicate a seed simulation N times along one or more axes. It's easy to say what this means when we have one person in the seed simulation and we say "vary this along an axis of salary between 0 and 3000", or one household and we say "vary the rent between 0 and 3000".

It becomes more complicated when there are two households, because then a "rent" value for instance is no longer a scalar but a vector. So what does it mean to say "replicate this two-household seed simulation varying the rent from 0 to 3000" ? This "0" is in fact a shorthand for the vector [0, 0]. But the target value is ambiguous. It could be the vector [3000,0] or the vector [3000,3000].

As usual with OpenFisca the problem is not choosing a behaviour. There is not always a "right" or "wrong" answer and sometimes many ways could make sense. The problem is when you change something that was not formally specified but that other people have come to rely on, mentally supplying their own informal specification which is not compatible with your way. So now it works for you but it breaks for others.

So my best advice is to be careful, and to give as much explicit information as you can about why you would pick one way rather than the other. Document why you think it should work that way, with real-world examples. Compare with other users' examples.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe @adrienpacifico uses this as well and could give us some insight as how he understands the behaviour and some real-world examples?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maukoquiroga Yes, insofar as it's an example of something you might actually want to calculate where axes play a role; but also no, as it's not an example of what's being discussed here i.e. how specifically we expect a seed situation with more than one household to be expanded along an axis. So you couldn't really conclude anything from that specific example that motivates a decision in this present case.

@@ -146,4 +146,4 @@ def test_simulation_with_axes(simulation_builder):
data = yaml.safe_load(input_yaml)
simulation = simulation_builder.build_from_dict(tax_benefit_system, data)
assert simulation.get_array('salary', '2018-11') == approx([0, 0, 0, 0, 0, 0])
assert simulation.get_array('rent', '2018-11') == approx([0, 0, 3000, 0])
assert simulation.get_array('rent', '2018-11') == approx([0, 0, 3000, 3000])
Copy link
Member Author

@guillett guillett Jan 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question here @benjello. (You will have to expand the visible code snippet to get the full context.)

Many thanks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.

@bonjourmauko bonjourmauko added kind:feat A feature request, a feature deprecation and removed kind:feat A feature request, a feature deprecation labels Jan 3, 2020
@@ -45,7 +45,7 @@ def test_add_axis_with_group(simulation_builder, persons):
simulation_builder.add_parallel_axis({'count': 2, 'name': 'salary', 'min': 0, 'max': 3000, 'period': '2018-11'})
simulation_builder.expand_axes()
assert simulation_builder.get_count('persons') == 4
assert simulation_builder.get_ids('persons') == ['Alicia0', 'Javier1', 'Alicia2', 'Javier3']
assert simulation_builder.get_ids('persons') == ['Alicia0', 'Javier0', 'Alicia1', 'Javier1']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is prudent; if two people have the same name then ids will be non-unique.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That non-uniqueness would be consistent with the input data.
IMHO, the axe expand mecanism should not try to mitigate non-uniqueness issues.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding non-uniqueness, SimulationBuilder already ignores duplicated people before, as you can see with this test:

def test_ignores_duplicated_people(simulation_builder, persons):
    simulation_builder.add_person_entity(persons, {'Alicia': {}, 'Alicia': {}, 'Alicia': {}})
    assert simulation_builder.get_count('persons') == 1

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After taking a look at the code, I guess this change won't have an impact on simulations as those ids are used solely in the context of axes expansion.

@Morendil
Copy link
Contributor

Morendil commented Jan 4, 2020

@guillett What leads you to conclude that the currently implemented behaviour is wrong ?

I cannot say I am 100% certain that the existing code is correct, it took me quite a bit of effort at the time but it was long ago and I don't remember the context well enough; but I'd advise you to be careful as it's a somewhat complex feature.

@guillett
Copy link
Member Author

guillett commented Jan 4, 2020

Thanks @Morendil, that PR is an refactor extracted from #934.

The expected array get_input('rent', '2018-11') == approx([0, 0, 3000, 0]) isn't what I expect from a parallel axis.

Plus, I don't understand why/how array[axis_index:: axis_entity_step_size] = [...]] was supposed to work.

@benjello
Copy link
Member

benjello commented Mar 5, 2021

@guillett @maukoquiroga : ce serais bien de voir si ce point est toujours à régler ?

@guillett
Copy link
Member Author

guillett commented Mar 5, 2021

Ce problème est toujours à régler mais on a de temps disponible et c'est un problème qui n'apparait que dans des cas assez particuliers et en particulier sur https://betagouv.github.io/mes-aides-changent/ avec plusieurs membres dans une famille.

@bonjourmauko bonjourmauko added the kind:fix Bugs are defects and failure demand. label Apr 1, 2021
@bonjourmauko
Copy link
Member

Hi @benjello and @guillett ! I've been working on axes this weekend in order to better understand what is expected and how this proposal could be just a bug fix or a new feature. As it has been already mentioned, it is a bit an archeological work of trial, error, and documentation.

What do I know up to this point:

  • The expanding behaviour proposed by @guillett seems to be the one we would expect for expanding a group entity, that is given the scenario of expanding rent over two households, I would expect either of the following:
[0, 0, 3000, 3000]
[0, 3000, 0, 3000]

The current behaviour of [0, 0, 3000, 0] seems a bug to me.

  • I was curious to know how that would be applied in the scenario of a group entity expanded over a parallel and a perpendicular axis, and here's what I've found:

I've tried an got passing in master the following:

def test_add_perpendicular_axes_on_group(simulation_builder, persons, group_entity):
    simulation_builder.add_person_entity(persons, {'Alicia': {}, 'Javier': {}})
    simulation_builder.add_group_entity('persons', ['Alicia', 'Javier', 'Tom'], group_entity, {
        'housea': {'parents': ['Alicia']},
        'houseb': {'parents': ['Tom'], 'children': ['Javier']},
        })
    simulation_builder.register_variable('rent', group_entity)
    simulation_builder.register_variable('taxes', group_entity)
    simulation_builder.add_parallel_axis({'count': 3, 'name': 'rent', 'min': 0, 'max': 3000, 'period': '2018-11'})
    simulation_builder.add_perpendicular_axis({'count': 2, 'name': 'taxes', 'min': 0, 'max': 2000, 'period': '2018-11'})
    simulation_builder.expand_axes()
    assert simulation_builder.get_input('rent', '2018-11') == approx([0, 0, 1500, 0, 3000, 0, 0, 0, 1500, 0, 3000, 0])
    assert simulation_builder.get_input('taxes', '2018-11') == approx([0, 0, 0, 0, 0, 0, 2000, 0, 2000, 0, 2000, 0])
    assert simulation_builder.get_ids('households') == ['housea0', 'houseb1', 'housea2', 'houseb3', 'housea4', 'houseb5', 'housea6', 'houseb7', 'housea8', 'houseb9', 'housea10', 'houseb11']

It would probably make sense to rather have it this way:

[0, 0, 0, 0, 1500, 1500, 1500, 1500, 3000, 3000, 3000, 3000]
[0, 0, 0, 0, 0, 0, 2000, 2000, 2000, 2000, 2000, 2000]

I've tried that test with the changes proposed in this pull request and it fails:

ValueError: could not broadcast input array from shape (6) into shape (12)
  • Regarding ids, currently they're not meaningful in terms of dimensions: we do not have any way to know over which axis we're expanding each household.

So your proposal makes sense, but opens the question I guess posed already by @Morendil: if by meaningful we mean expanded entities should be named per axis count and not per cell count, How should we name them when we're expanding them over several axes? (in this case a parallel and a perpendicular one).


What's next?

I think the current documentation about how axes are supposed to work is incomplete, and I would be in favour of making those contracts explicit with tests and documentation —I've started that effort here #996. This will be a breaking change, but as a counterpart, give users a clear way of understanding how axes are being expanded.

To be continued.

@bonjourmauko
Copy link
Member

How should we name them when we're expanding them over several axes? (in this case a parallel and a perpendicular one)

Maybe that one is actually easy, something like this for two parallel with 2 steps and one perpendicular with 3 steps axes?

HouseholdA[(0, 0)][0]
HouseholdB[(0, 0)][0]
HouseholdA[(1, 0)][0]
HouseholdB[(1, 0)][0]
HouseholdA[(2, 0)][0]
HouseholdB[(2, 0)][0]
HouseholdA[(0, 1)][0]
HouseholdB[(0, 1)][0]
HouseholdA[(1, 1)][0]
HouseholdB[(1, 1)][0]
HouseholdA[(2, 1)][0]
HouseholdB[(2, 1)][0]

...

And so on.

@bonjourmauko bonjourmauko added this to the Add support for weeks milestone Sep 17, 2024
@bonjourmauko bonjourmauko mentioned this pull request Sep 17, 2024
@bonjourmauko bonjourmauko mentioned this pull request Sep 25, 2024
32 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind:fix Bugs are defects and failure demand.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants