Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FB] [PI-3478] Lenient metadata #3739

Merged
merged 46 commits into from
Jul 3, 2020

Conversation

bjlittle
Copy link
Member

@bjlittle bjlittle commented Jun 15, 2020

This PR introduces a new lenient client/service infrastructure and also extends the common metadata API.

This is the second PR that underpins issue #3478 and builds on PR #3583.

The lenient infrastructure offers a new workflow to iris that allows us (and users) to easily control the strictness of operations. This has been applied to metadata equality, combination and difference operations in this PR, but in principle lenient operations could easily be extended to other areas of iris e.g., merge and concatenate etc

The common metadata API has been extended to also support equality, inequality, combination and difference operations in both strict and lenient modes.

The iris.common.metadata module also supports its own logging, which includes a root logger for iris. Ideally, I'd like to overhaul the obscene use of warnings within Iris with replace their use with a logging based infrastructure instead - but that's for another day, and certainly not in this PR.

In addition, it is now possible to assign a partial Mapping, or metadata from another iris instance to any other e.g., the following

cube.metadata = coord.metadata
cube.metadata = dict(standard_name="air_temperature", units="K")

just works 😄


Required to git cherry-pick the following PRs from master to resolve known travis-ci test failure issues:


TBD:

  • remove WIP PR title and take out of draft
  • rebase with upstream/master to fix travis-ci
  • iris.common.metadata._NamedTupleMeta change token usage to explicit "_members"
  • iris.common.lenient.Lenient.register_client(..., append=False)
  • complete lenient.py test coverage
  • complete metadata.py test coverage
    • BaseMetadata
      • combine
      • difference
      • equal
    • AncillaryVariableMetadata
    • CellMeasureMetadata
    • CoordMetadata
    • Cubemetadata
  • check doc-strings
  • remove lenient/strict debug
    • adopt logging?
  • Lenient.context ephemeral __context client services append to active client services?
  • Limited public Lenient API veneer to _Lenient

Open questions:

  • decide on Lenient.context client services - clobber vs append?
  • decide on the default lenient member behaviour for named tuple metadata __eq__, compare and difference operations
  • decide on the difference result of equivalent namedtuple metadata instances - None vs instance?

Follow-up PRs:

  • support @lenient_service(enable=<boolean>) and Lenient.register_service(<service, enable=<boolean>)
  • implement Lenient.__dict__ clients and services namespace separation
  • cache namedtuple metadata instances for use by the MetadataManager.values property
  • implement iris logging infrastructure

@bjlittle bjlittle added this to the v3.0.0 milestone Jun 15, 2020
@bjlittle bjlittle changed the title Lenient metadata [FB] [PI-3478] [WIP] Lenient metadata Jun 15, 2020
lib/iris/common/metadata.py Outdated Show resolved Hide resolved
@pp-mo
Copy link
Member

pp-mo commented Jun 15, 2020

Only briefly scanned so far.
A couple of simple, structural points stuck out for me:

(1)
from what I can tell, the general principle of switchable operation settings, decorators and context blocks is independent of the particular implementation of the 'lenient' operations that it is used for -- in fact, it's pretty similar to the old "Future" class usage.
So that logic could be separated out from the operations themselves ?
(might that even be mostly just a rename, removing the references to 'lenient' operations per se ?)

(2)
I was surprised to see so much code repeated throughout the different Metadata subclasses.
There seems to be huge commonality in the different combine/compare operations : they basically differ only in what contained property is comnpared.
Also, each Metadata subclass overrides all of __eq__, combine, difference and equal,
but using identical code in each case.
Surely that can be tidied ?
( I guess that might mean more metaclasses, but possibly not : if we have a "standard set" of methods then a data structure can handle it. )

@pp-mo
Copy link
Member

pp-mo commented Jun 29, 2020

Like @lbdreyer (I think?), I find the existing "LENIENT.context" API rather unintuitive : hard to use, and hard to understand.
Naively, I'd expect this to have a form more like with LENIENT.context(switch_thing=True) : that would also make it much easier to make something explictly strict, just by saying "=False".
(Note : this is even more similar to the existing FUTURE usage).

However, with current code, I think a basic problem with that is that we can't use service identifiers as keywords, even though we have standard string forms for them, because the strings are not valid keywords !
So, I think we could fix that just by modifying the qualname function, so we can use qualnames as keywords.
I think this approach might also make it much easier + more obvious to have convenience "combination" settings of the sort that @lbdreyer is suggesting : They could be provided as extra keywords to "context".
So on the whole, I feel that a "context" API that uses keywords rather than arguments to connect clients + services would be more flexible and easier to use.

@pp-mo
Copy link
Member

pp-mo commented Jun 29, 2020

(and, probably related to above point...)
I'm a bit concerned that the current form of the control statements (@lenient_client and LENIENT.context) is very tightly coupled to our current model of what we want to do :
So, it is totally designed to "turn on a boolean option", but it's not so obvious to turn something off, and not at all to have options that might be non-binary.

I do appreciate that the binary lenient/strict choice is a really useful simplification in the existing proposals, but I still think we might want to leave space to extend this in future.
Again, the future needs of merge+concatenate bother me here : Just having "lenient" or "strict" might not be sufficiently useful there.

E.G. you might want a form that allows you have specify "load with a merge which is lenient on attributes but strict on names" ?

  • instead of with LENIENT.context(services=(service_A, service_B), active=Cube.__eq__):
  • something more like with LENIENT.context(Cube_eq={'Service_A':True, 'service_B':False}):

N.B. Sorry this isn't at all a clear suggestion yet
I think "merge" is a good candidate for something we may need more control over,
but we aren't yet at all clear what it might need

@bjlittle bjlittle closed this Jun 30, 2020
@pp-mo
Copy link
Member

pp-mo commented Jul 1, 2020

I've crossed over with this.
I have made some concrete suggestions towards changing the Lenient.context API, in the ways I was thinking about above.
All modified code + tests (passing!)
See : pp-mo#62

@pp-mo
Copy link
Member

pp-mo commented Jul 1, 2020

A couple of notes about the testing, from work on pp-mo#62

  1. refactoring there was unnecessarily hard because the tests in unit/common/lenient all rely on testing LENIENT.__dict__, though that isn't really a public interface (which is why I felt I could refactor it)
  2. we are anyway missing any functional integration tests for the behaviour of clients and services. I think we could really use some
    * E.G. I wrote something a bit like it at https://github.com/pp-mo/iris/blob/d67034ff071b1ee837ee48f5131ba5d8b8737a3f/lib/iris/tests/unit/common/lenient/test_Lenient.py#L668
    Note, the purpose of all that is frankly still a bit confused. But it does positively check the actual behaviour of a real client

@bjlittle bjlittle reopened this Jul 2, 2020
Copy link
Member

@lbdreyer lbdreyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still have some tests to review, but I thought I'd submit the comments I have so far.

I have a few minor comments about the iris.common.lenient module, but most of my more pertinent comments are about iris.comon.metadata, so my preference is to get those resolved first if possible

lib/iris/common/lenient.py Outdated Show resolved Hide resolved
lib/iris/common/lenient.py Outdated Show resolved Hide resolved
lib/iris/common/lenient.py Show resolved Hide resolved
lib/iris/common/lenient.py Show resolved Hide resolved
lib/iris/common/metadata.py Show resolved Hide resolved
result = (
self.coord_system == other.coord_system
and self.climatological == other.climatological
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought the plan was to be lenient with coord_system but strict with climatological?
So for:

cube_a.coord_system = Mercator
cube_b.coord_system = None

Should compare as True but this code would return False.


However, if we do want to be strict with both, could you rework this to use _members? It may save us in future when we add an extra member to CoordMetadata; saves us having to remember to also change it here?
Something like:

result = all([self.field == other.field for field in self._members])

Copy link
Member Author

@bjlittle bjlittle Jul 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now this is getting interesting, and to the heart of the question "what should be lenient?"

Personally, I think that climatological should be compared strictly in lenient mode - I guess simply because it's boolean, and mixing in leniency doesn't really help matters, so keep it simple.

For coord_system I flipped on this a few times. But at the end of the day I think that it's a pretty important core piece of metadata that makes a massive difference to interpreting the associated spatial coordinates. I think that it's so important that users should always specify it and not rely on defaults or assumptions. We've certainly not made any statements regarding the default coord_system if it's not specified. So in that regard I erred on the side of comparing strictly in lenient mode - it seems like a non-controversial, safe stance to take.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went explicit when comparing, but happy to be more generic as you suggest. I've applied that pattern elsewhere, so why not here 👍

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK: almost any time that a cube coordinate is used, the coordinate system will make important differences (certainly in plotting, for example), so it makes sense to always check equality strictly. As @bjlittle says, the fact that we don't have a default coordinate system means we should also treat None as explicitly different to a specified coordinate system (IMHO).

value = (
self.cell_methods
if self.cell_methods == other.cell_methods
else None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit undecided about this. What if we have:

cube_a.cell_methods = [cell_method_a, cell_method_b]
cube_b.cell_methods = [cell_method_a]

So they both have cell_method_a, but the first cube also has another cell_method which the second cube doesn't have.
Currently this just throws both away, but should we consider keeping the common cell_method?

Copy link
Member Author

@bjlittle bjlittle Jul 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm it's a good question... again a case where I'm not sure so erred on the side of being strict in lenient mode.

I feel like it kinda, sorta depends on the operation the user is performing, and it's not possible to guess the context at this particular lower level in the code. My naive philosophy here was that if the cell methods are the same then it's pretty obvious to know what to do, otherwise don't even attempt to guess what to do (as it might be half right, or even worse completely wrong) so leave the user to apply their intelligence and experience to the metadata.

As a concrete example, consider adding a 3d cube (t, y, x) and a 2d (y, x) cube. They both have cell methods, but the 3d cube has two cell methods on the t and y dimensions, whereas the 2d cube has a single cell method on the y dimension. Now the result of the addition operation is a 3d cube (thru the wonders of broadcasting), and you may argue that the resultant cube should also have two cell methods - having one cell method would be misleading. In actual fact, the context of the operation lives with the addition operation, it can make a reasonable assertion of what the cell methods should be, and as it happens it throws away the cell methods, and I kinda agree with that. Even so, the user can easily add them back, if they care and it makes sense to them.

So, after revisiting this and coming full circle again, I'm back to the same conclusion that what I've proposed in this PR seems reasonable, and a safe stance to take (and don't ask me tomorrow, as I may change my mind 😄)

Anyways, I'm kinda glad that your finding it "non-obvious" and grappling with the same conundrums that I found myself pondering over 😉

I think we just need to be a fair mix of reasonable and opinionated here.... not unless I'm missing completely obvious.

I've no idea if this has been of any help, whatsoever to you...

)

@lenient_service
def combine(self, other, lenient=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why have you decided to make these public?

Copy link
Member Author

@bjlittle bjlittle Jul 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, what do you mean? Make what public? The method? Or the lenient keyword argument?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean that CubeMetadata.combine CubeMetadata.difference and CubeMetadata.equal are a part of the public API, so users could happily use them, and expect that in upcoming minor releases their code would still work?

I am possibly being a bit overcautious with what we add to the public API, since this is for a major release, so it would be a long time before we could make changes to anything public.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My original intention was always for these methods to be public.

I've always seen them as offering stable behaviour (i.e., they're not going to fundamentally change from as is now) and that they offer real, tangible benefit and value not only to us for internal iris use, but also to users - they now have some rich behaviour to go along with the metadata overhaul that we've banked so far through the common metadata API.

So for me adding them completes this whole common metadata API that we're now offering, and I don't associate any risk with that (certainly no more so that introducing anything else new into iris).

I think that it would be a mistake not to make them public.

Copy link
Member

@lbdreyer lbdreyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more comments, specifically relating to metadata testing.

I still have to review the lenient.py testing but I'm hoping to do that tomorrow

def test_cannot_compare(self):
other = CubeMetadata(*(None,) * len(CubeMetadata._fields))
result = self.metadata.__eq__(other)
self.assertIs(NotImplemented, result)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer a more descriptive name for this test. Maybe something lik
test_cannot_compare_diff_class

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you can imagine, I was running low on "creative test naming juice"... Happy to update it 👍

rmetadata = self.cls(**right)

self.assertTrue(lmetadata._compare_lenient(rmetadata))
self.assertTrue(rmetadata._combine_lenient(lmetadata))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_combine_lenient -> _compare_lenient

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoa... freaky test pass 😱

Nice spot, I'll fix that 👍


result = lmetadata._combine_lenient(rmetadata)
expected = list(left.values())
self.assertEqual(expected, result)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I notice that for most tests you check that the operation is commutative, for example in the next test (test_strict_units_different) you do :
result = lmetadata._combine_lenient(rmetadata)
as well as
result = rmetadata._combine_lenient(lmetadata)

For this test, however, you only do the one way,
result = lmetadata._combine_lenient(rmetadata)
I suppose it might be a bit overkil to do both? Did you only do it the one way when you were comparing the metadata objects?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I was rolling through the testing (while took a looooong time) my testing approach slowly matured and I progressed.

The commutative testing angle came along mid-way through, and I've forgotten to apply it retrospectively - nice spot 👀

I defo think it's worthwhile doing. So I'll sweep back through the tests and apply this where it might be applicable.

Copy link
Member Author

@bjlittle bjlittle Jul 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think that I latched onto the commutative pattern of testing post BaseMetadata testing. So I'll add some similar coverage there now where it's missing... thanks.

Copy link
Contributor

@trexfeathers trexfeathers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to bank my comments on the functional code in case I don't finish looking at the tests (not really keeping up with everyone else!).

The implementation makes sense and is follow-able 👍 I have some hang-ups about the generality of some of the error messages, which I commented on in the code, but I wouldn't consider these deal-breakers.

lib/iris/common/mixin.py Outdated Show resolved Hide resolved
lib/iris/common/lenient.py Outdated Show resolved Hide resolved
lib/iris/common/lenient.py Outdated Show resolved Hide resolved
lib/iris/common/lenient.py Outdated Show resolved Hide resolved
lib/iris/common/lenient.py Outdated Show resolved Hide resolved
lib/iris/common/lenient.py Outdated Show resolved Hide resolved
lib/iris/common/lenient.py Outdated Show resolved Hide resolved
lib/iris/common/metadata.py Show resolved Hide resolved
result = (
self.coord_system == other.coord_system
and self.climatological == other.climatological
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK: almost any time that a cube coordinate is used, the coordinate system will make important differences (certainly in plotting, for example), so it makes sense to always check equality strictly. As @bjlittle says, the fact that we don't have a default coordinate system means we should also treat None as explicitly different to a specified coordinate system (IMHO).

lib/iris/common/metadata.py Outdated Show resolved Hide resolved
@lbdreyer
Copy link
Member

lbdreyer commented Jul 3, 2020

Okay I'm finished reviewing the tests now and I don't have any outstanding issues I want to raise.

So that means... 🥁 (drumroll) 🥁 ... I am going to merge this in! Just in time for the Friday deadline 🙌

Lots of hard work has gone in to this PR! 💯 💯 A big thank to @bjlittle and everyone who helped to review!!

@lbdreyer lbdreyer merged commit fc5f8f0 into SciTools:cube-arithmetic Jul 3, 2020
@bjlittle
Copy link
Member Author

bjlittle commented Jul 4, 2020

@lbdreyer Awesome! Thanks for pushing this PR over the line, much appreciated... this now unblocks me for the next and last PR in this feature branch, which will finally fix cube arithmetic 😄

bjlittle added a commit to bjlittle/iris that referenced this pull request Aug 14, 2020
* add lenient infra-structure

* add metadata lenient __eq__ support

* complete __eq__, combine and difference support

* explicit inherited lenient_service + support equal convenience

* fix attributes difference + lenient kwargs

* make lenient public + minor tidy

* rename MetadataManagerFactory to metadata_manager_factory

* extend lenient_client decorator to support services registration

* add lenient test coverage

* purge qualname usage in metadata.py

* support global enable for lenient services

* support partial mapping metadata assignment

* purge Lenient.__setattr__ from api

* add BaseMetadata compare test coverage

* metadata rationalisation

* add BaseMetadata difference test coverage

* added context manager ephemeral comment clarification

* add BaseMetadata __ne__ test coverage

* standardise lenient decorator closure names

* add BaseMetadata equal test coverage

* half dunder context

* add AncillaryVariableMetadata test coverage

* add additional AncillaryVariableMetadata test coverage

* add CellMeasureMetadata test coverage

* Clarify lenient_service operation + simplify code.

* add CoordMetadata test coverage

* add CubeMetadata test coverage

* metadata tests use self.cls

* fix typo

* fix context manager ephemeral services

* add logging

* Pin pillow to make graphics tests work again. (SciTools#3630)

* Fixed tests since Numpy 1.18 deprecation of non-int num arguments for linspace. (SciTools#3655)

* Switched use of datetime.weekday() to datetime.dayofwk. (SciTools#3687)

* New image hashes for mpl 3x2 (SciTools#3682)

* New image hash for iris.test.test_plot.TestSymbols.test_cloud_cover with matplotlib 3.2.0.

* Further images changes for mpl3x2.

* Yet more updated image results.

* fix sentinel uniqueness test failure

* remove redundant cdm mapping test

* difference returns None for no difference

* protect Lenient and LENIENT private

* privitise lenient framework and add API veneer

* add explicit maths feature default

* review actions

* review actions

* trexfeathers review actions

* stephenworsley review actions

Co-authored-by: Patrick Peglar <[email protected]>
Co-authored-by: Martin Yeo <[email protected]>
abooton pushed a commit that referenced this pull request Aug 14, 2020
* PI-3478: Common metadata API (#3583)

* common metadata api

* rationalise _cube_coord_common into common

* move state into metadata

* MetadataFactory test coverage

* temporarily pin back iris-grib

* test coverage for iris.common.metadata._BaseMeta

* test coverage for iris.common.mixin.LimitedAttributeDict

* remove temporary iris-grib pin

* review actions

* Update lib/iris/tests/unit/common/metadata/test_BaseMetadata.py

Co-Authored-By: lbdreyer <[email protected]>

* [FB] [PI-3478] Lenient metadata (#3739)

* add lenient infra-structure

* add metadata lenient __eq__ support

* complete __eq__, combine and difference support

* explicit inherited lenient_service + support equal convenience

* fix attributes difference + lenient kwargs

* make lenient public + minor tidy

* rename MetadataManagerFactory to metadata_manager_factory

* extend lenient_client decorator to support services registration

* add lenient test coverage

* purge qualname usage in metadata.py

* support global enable for lenient services

* support partial mapping metadata assignment

* purge Lenient.__setattr__ from api

* add BaseMetadata compare test coverage

* metadata rationalisation

* add BaseMetadata difference test coverage

* added context manager ephemeral comment clarification

* add BaseMetadata __ne__ test coverage

* standardise lenient decorator closure names

* add BaseMetadata equal test coverage

* half dunder context

* add AncillaryVariableMetadata test coverage

* add additional AncillaryVariableMetadata test coverage

* add CellMeasureMetadata test coverage

* Clarify lenient_service operation + simplify code.

* add CoordMetadata test coverage

* add CubeMetadata test coverage

* metadata tests use self.cls

* fix typo

* fix context manager ephemeral services

* add logging

* Pin pillow to make graphics tests work again. (#3630)

* Fixed tests since Numpy 1.18 deprecation of non-int num arguments for linspace. (#3655)

* Switched use of datetime.weekday() to datetime.dayofwk. (#3687)

* New image hashes for mpl 3x2 (#3682)

* New image hash for iris.test.test_plot.TestSymbols.test_cloud_cover with matplotlib 3.2.0.

* Further images changes for mpl3x2.

* Yet more updated image results.

* fix sentinel uniqueness test failure

* remove redundant cdm mapping test

* difference returns None for no difference

* protect Lenient and LENIENT private

* privitise lenient framework and add API veneer

* add explicit maths feature default

* review actions

* review actions

* trexfeathers review actions

* stephenworsley review actions

Co-authored-by: Patrick Peglar <[email protected]>
Co-authored-by: Martin Yeo <[email protected]>

* [FB] [PI-3478] Lenient cube arithmetic (#3774)

* initial cube arithmetic

* support in-place cube resolve

* fix non in-place broadcasting

* remove temporary resolve scenario test

* lenient/strict support for attributes dicts with numpy arrays

* lenient/strict treatment of scalar coordinates

* strict points/bounds matching

* lenient/strict prepare local dim/aux/scalar coordinates

* support extended broadcasting

* always raise exception on points/bounds mismatch

* ignore scalar points/bounds mismatches, lenient only

* remove todos

* tidy logger debugs

* qualify src/tgt cube references in debug

* Numpy rounding fix (#3758)

ensure rounding is numpy like (maintains type)

* avoid unittest.mock.sentinel copy issue

* fast load np.int32

* fix cube maths doctest

* fix iris.common.resolve logging configuration

* fix prepare points/bounds + extra metadata cml

* support mapping reversal based on free dims

* var_name fix for lenient equality

* add support for DimCoordMetadata

* fix circular flag + support CoordMetadata and DimCoordMetadata exchange

* fix circular issue for concatenate DimCoord->AuxCoord demotion

* fix concatenate._CubeSignature sorted

* minor tweaks

* keep lenient_client private in maths

* tidy maths

* tidy iris.analysis.maths.IFunc

* refactor IFunc test

* polish in-place support

* tidy metadata_resolve

Co-authored-by: stephenworsley <[email protected]>

* rebase master fix-up for cube arithmetic

* add missing new dependency to readthedocs.yml requirements

Co-authored-by: lbdreyer <[email protected]>
Co-authored-by: Patrick Peglar <[email protected]>
Co-authored-by: Martin Yeo <[email protected]>
Co-authored-by: stephenworsley <[email protected]>
@bjlittle bjlittle deleted the lenient-metadata branch February 15, 2021 13:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants