Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

it should be possible to have pyff stop a pipline unless all load statements succeed #62

Open
leifj opened this issue Apr 28, 2015 · 11 comments

Comments

@leifj
Copy link
Contributor

leifj commented Apr 28, 2015

This is related to #61

@leifj
Copy link
Contributor Author

leifj commented Jul 2, 2015

Thinking about how to implement this in the load pipeline it looks like it will be quite involved because of the complex error handling in that function. My thought is instead to do a separate "check" pipeline that can be used as a kind of assert-like function that checks that a set of conditions hold in the current active store and fails appropriately otherwize. For instance it could look something like this:

- load:
   - source1 as ONE
   - source2 as TWO
- check:
   - entityID_1 in ONE # true iff entityID_1 is in collection ONE
   - any in entities   # entities == the full collection, true iff there is at least 1 entity in the store
   - entityID_2 in entities # true iff entityID_1 is any collection

This covers all cases where a fetch fails completely (eg edugain MDS is down or a signature validation fais). The second type of failure is that a number of entities are filtered from a source (eg because of syntax problems). From the check pipe it would be relatively simple to make the parse errors available so you could do something like

- check not:  # lets put all negation on the outside of the parenthesis for simplicity
   - any in ERROR  # true iff at least one entity has been filtered

or

- check not: 
   - ONE in ERROR  # true iff  at least one entity from ONE  has  been filtered

or

- check not: 
   - entityID_1 in ERROR  # true iff  entityID_1 has  been filtered

The latter form is perhaps less useful since it is equivalent to "entityID_1 in any".

I'm not 100% sure about the syntax and would appreciate feedback! When check fails it would simply interrupt the pipeline with an exception. For pyffd this means returning a 500 which probably will result in metadata clients retrying later. This is probably the correct behavior.

@salaun-urennes1
Copy link

Hi Leif,

The "check" pipeline you describe would be useful for sure, allowing to implement checks on the content of loaded SAML metadata.

The issue #47 was referring to lower level checks like "the metadata file/URL is accessible" or "the SAML metadata is proper XML and could be loaded". I think these type of checks should be hard-coded in pyff because if one of the defined metadata inputs is missing for one of these reasons, the result of the pipeline cannot be trusted anymore.

@leifj
Copy link
Contributor Author

leifj commented Jul 17, 2015

Yeah pyff won't expose invalid metadata or invalid URLs (obviously) via the repository. The only thing we're talking about is filtered entitydescriptor elements. The 'check' pipe is meant to allow you to stop processing unless certain critical entities are present - eg your "own" metadata.

@pmeulen
Copy link
Contributor

pmeulen commented Oct 8, 2015

Discussed this with @salaun-renater. The desired be behaviour would be for pyff to stop when it encounters a problem loading metadata. With "stop" I mean exiting pyff with a non zero exit code. This way a problem in the metadata (pipeline) can't be missed. The use case here is using pyff as one of the tools in a larger metadata processing setup.

After working though the pyff pipeline code I found several places where errors (exceptions) are logged, but do not cause termination of the pipeline:

Default behaviour is to filter (ignore) invalid entities: https://github.com/leifj/pyFF/blob/master/src/pyff/mdrepo.py#L650 This behaviour can be disabled using filter_invalid already.

What I propose is to:

  • Add two options to load to control the error handling of load: filter_invalid and fail_on_error. This is in addition to the existing validate, max_workers and timeout options.
  • fail_on_error True propagates the exception in the three cases above instead of logging the error and continuing.
  • filter_invalid False throws an exception when invalid metadata is found.

From the discussion above understand that there are other use cases for which this stop behaviour is not desired, so any new options should keep the current error behaviour intact.

So a load with the new error behaviour would look like, leaving the options out would give the current load behaviour:

- load fail_on_error True filter_invalid False:
   - source1 as ONE
   - source2 as TWO

@leifj: I can make a PR for this. Do you see any issues implementing this?

@leifj
Copy link
Contributor Author

leifj commented Oct 8, 2015

Whats wrong with the check approach? I ask because I'm hoping to make load much more asynchronous in the future to support managing differentiated loading of resources. So ... yeah your approach would work but the check pipeline would work even if load gets completely refactored... On the other hand... maybe your approach is a good complement.

I would need test-cases though

@pmeulen
Copy link
Contributor

pmeulen commented Oct 8, 2015

The check approach could certainly work. It's a more verbose in writing. Not sure if it is easier to read. I like the idea of checking for the presence of an entity. You could add other assertions as well, like running an xpath and testing its value.

The equivalent to my example above would be:

- load
   - source1 as ONE
   - source2 as TWO
- check:
   - any in ONE # Check that MD file ONE was loaded
   - any in TWO # Check that MD file TWO was loaded
- check not:
   - any in ERROR # Check for filtered items

Allowing the not to be in from of the expressing would simplify things a bit:

...
- check:
   - any in ONE # Check that MD file ONE was loaded
   - any in TWO # Check that MD file TWO was loaded
   - not any in ERROR # Check for filtered items

I think implementing the option approach I suggested can be done in 20-30 lines. Implementing the check approach seems much more involved too me with the required plumbing, bookkeeping and parsing to make it work. It does offer more than that we need now at least. So with getting the desired result with the minimum amount of effort in mind, I suggested the option approach. Basically building from what was already present in the code.

Another difference between the approaches is that with the option approach you abort at the moment an error occurs. The problem can typically be found the end of the log. With the check approach you will detect the problem after the fact and generally have less state available.

@leifj
Copy link
Contributor Author

leifj commented Oct 8, 2015

Right but what happens if load is just scheduling stuff for a worker/queue type of setup. Then you might not catch an error immediately but only when that worker finished which might be before or after something in the load list.

I don't think I'm against a "fail-on-error" but just want us to be clear what the reasonable expectations are as load evolves over time.

@ghost
Copy link

ghost commented Oct 8, 2015

Right but what happens if load is just scheduling stuff for a worker/queue type of setup. Then you might not catch an error immediately but only when that worker finished which might be before or after something in the load list.

Load indeed uses worker threads for (down)loading files (is that what you are referring to?). This parallelism does complicate processing because more thing could be going on at the same time, and that can make finding error harder in a log. The load implementation can rethrow the exception that is coming from the thread. That way you will still get a nice stack trace out of it, including the bit that happened in the thread.

Additionally the number of workers can be set to 1 by adding the max_workers 1 option to load.

I don't think I'm against a "fail-on-error" but just want us to be clear what the reasonable expectations are as load evolves over time.

Well, I don't know what the future is going to bring... Is there additionally load functionality that you plan to add soon? A reasonable expectation from pyff users would be that the same pipe (e.g. loading three local MD files from disk) will continue to behave the same way with regard error handling. New ways of using (new) functionality in pyff may demand other ways of error handling.

@leifj
Copy link
Contributor Author

leifj commented Oct 9, 2015

I did plan to refactor load a bit yeah.. and add better parallelism than we already have. However it should be possible to keep the interface.

Specifically what I'm worried about is this. Lets say we have the following load:

  • load
    • foo
    • bar
    • baz

And bar fails. That would imply that both foo and baz or neither of them gets loaded before bar fails. If that is ok we should be fine with your proposed solution.

@pmeulen
Copy link
Contributor

pmeulen commented Oct 9, 2015

Yes, I want an error => exit scenario and don't care what has been loaded or not at the point of an error. It's going to exit anyway. This is useful when using pyff as a tool to generate metadata controlled form another script, makefile or whatever.

For the scenario of keeping pyffd running as a demon to continuously serve fresh metadata setting fail_on_error option won't be very useful IMO.

I'll start working on a PR to implement this.

@leifj
Copy link
Contributor Author

leifj commented Oct 9, 2015

right - okdoki then

pmeulen added a commit to pmeulen/saml-metadata-tools that referenced this issue Jan 15, 2016
…equires recent (unreleased) pyff: IdentityPython/pyFF#62

Allow scons substitutions in pyff builder arguments
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants