Measurement of prevalence in the shacl-report #51

yum-yab · 2020-06-05T14:31:29Z

Hello,

is there any way to measure the prevalence of executed SHACL-tests, like getting the total number of instances of the sh:targetClass or a percentage like 0.95 of the instances of the given sh:targetClass fulfill the restrictions? If not I think it would be nice to have.

Best Regards

ashleysommer · 2020-06-07T22:36:26Z

Hi @yum-yab
No, pyshacl does not have such a feature. PySHACL strives to be a complete implementation of SHACL as per the official specification document. A feature like the one you requested is not part of the specification, and would not be useful for the vast majority of pyshacl users.

This would be fairly easy to implement, not in pyshacl but in a client application. You can compare the validation report to the input data and generate precisely the custom metrics your application requires.

JJ-Author · 2020-06-08T09:45:02Z

Hi @ashleysommer thank you for your quick reply. I understand the focus on the standard, but even one of the authors of the SHACL spec did implement such a feature in RDFUnit (unfortunately it has no support for shacl-af) so I am not really convinced that the vast majority is not interested in an error/success rate (which I agree they could calculate themselves - if they know the number of focus nodes which succeeded ) for the individual tests.

Maybe we are missing some "shortcut" or other trick but it seems not easy at all (without actually implementing big parts of the shacl standard and basically writing another (incomplete) shacl engine) to get the number of focusNodes which succeeded for a given test?

What would help would be e.g. an option to enable a "logging level" which would also include successful focusNodes in the report or return some triples (maybe even in a separate file) showing the number of successful focusNodes (for large scale analysis this would be the better option) for every shape/test. If you see other easy options to count the number of successful nodes on the client side, please let us know.

P.S.: I could also imagine to let @yum-yab make a PR for this (in case it is possible without major changes to core routines), but we are looking for a sustainable solution, so in case you don't see this as a meaningful feature of pySHACL that would not make sense for us.

P.P.S.: in case you are aware of any other free or open SHACL engine which would satisfy this requirement, this would also help us.

I appreciate any hints and pointers you can provide

ashleysommer · 2020-06-08T12:20:52Z

Hi @JJ-Author

Re-reading the original post, it seems there are two different requests here. One is quite easy, one is very hard.

Count the total number of items in your data which match a SHACL Shape's sh:targetClass, and compare that against the number of validation report entries for that Shape. That is one simple success/failure metric. When I said it would be easy to do on the application side, that is what I meant.
Returning the total number of shacl validation tests passed, vs validation failures, as a percentage of total tests performed. This needs to be done within the SHACL engine, and is much harder to do for the following reasons:

i) As you alluded to, pySHACL does not keep record of passed tests. The SHACL specification states that a SHACL engine should apply a constraint test against a collected set of focus nodes, and generate a validation report item (aka a failure) for every focus node which fails the test. There is no mechanism described for keeping record of focus nodes which do conform to the shape constraint. To add this feature would be non-trivial, and would affect a large portion of the pyshacl codebase (every constraint type evaluator method would need new parameters passed in and out, etc).

ii) A collected set of focus nodes is not necessarily the result of sh:targetClass. The original post mentions only sh:targetClass but a shape can apply to focus nodes using other Target types. See the focus node section of the spec for more info.

This introduces a lot requirement-solicitation into this feature request. Eg, should implicit class target focus nodes count in the pass/fail ratio, even though they were not an explicit target? What about subjects-of target type and objects-of targets type focus nodes? And that's only discussing Node Shapes. Should Property Shapes be included in the count too? They don't have a targetClass, and cannot map cleanly to any class for the kind of metrics required. There would need to be some important decisions made, and every individual's picture of the ideal feature set would be different. There is no specification for how a feature such as this should behave, so it would essentially need to be bespoke solution, made up as we go. If this feature is implemented in pySHACL, it would likely generate different metrics output than what is provided by RDFUnit, if that is what it is compared against.

I am currently the sole maintainer of pySHACL, and I don't have much time available to spend on building and maintaining new features. If @yum-yab would like to have an attempt at building the feature, please go ahead, but as stated above, due to the current architecture of pySHACL it would be non-trivial.

JJ-Author · 2020-06-09T14:10:01Z

Thank you very much @ashleysommer for the explanations. We are sorry for the confusion, the first message was trying to explain what we would like to achieve giving one example, but we knew that there are several ways to specify targets and it would not be straightforward to get the number of focusNodes passed to a test from the SHACL definition
Unfortunately, we are SHACL beginners so maybe we still do not understand the full dimension of the feature we have in mind but I thought as a POC and to get everybody conceptually on the same page my idea is that we could try the following (We don't want to bother you anymore, but for the record):

At line https://github.com/RDFLib/pySHACL/blob/master/pyshacl/shape.py#L456 we could hook on (also on https://github.com/RDFLib/pySHACL/blob/master/pyshacl/shape.py#L470) to write how many focus_nodes are passed per constraint (test) in a separate file. This should be consistent for all target types and comparable between different implementations.

ashleysommer · 2020-06-09T23:56:53Z

@JJ-Author
You're welcome to try to implement that feature independently using the method you described at those points in the code. If the feature works for your and fulfills your requirements, please submit the changes via a PR so we can get the feature integrated in PySHACL.

ajnelson-nist mentioned this issue Apr 25, 2023

Runtime measurement of shapes #180

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Measurement of prevalence in the shacl-report #51

Measurement of prevalence in the shacl-report #51

yum-yab commented Jun 5, 2020

ashleysommer commented Jun 7, 2020 •

edited

Loading

JJ-Author commented Jun 8, 2020 •

edited

Loading

ashleysommer commented Jun 8, 2020 •

edited

Loading

JJ-Author commented Jun 9, 2020

ashleysommer commented Jun 9, 2020

Measurement of prevalence in the shacl-report #51

Measurement of prevalence in the shacl-report #51

Comments

yum-yab commented Jun 5, 2020

ashleysommer commented Jun 7, 2020 • edited Loading

JJ-Author commented Jun 8, 2020 • edited Loading

ashleysommer commented Jun 8, 2020 • edited Loading

JJ-Author commented Jun 9, 2020

ashleysommer commented Jun 9, 2020

ashleysommer commented Jun 7, 2020 •

edited

Loading

JJ-Author commented Jun 8, 2020 •

edited

Loading

ashleysommer commented Jun 8, 2020 •

edited

Loading