On-chain RDF Canonicalization and Schema Validation #222

aaronc · 2021-01-20T15:55:17Z

aaronc
Jan 20, 2021
Maintainer

The most widely accepted method for canonicalizing RDF is URDNA2015 and SHACL is the only official W3C recommendation for RDF schema validation.

I have a few concerns about trying to use these models on-chain mainly related to computational complexity.

Take a look at the golang code for canonicalization: https://github.com/piprate/json-gold/blob/master/ld/api_normalize.go. It's fairly complex though maybe that's not a deal breaker. This relatively simple RDF takes about 36 microseconds to canonicalize. Either way we would need to audit the implementation and possibly customize it for gas &/or efficiency.

An alternative would be to place a restriction on graphs that they are acyclic. For instance a graph like the one above is basically the acyclic JSON structure:

{
  "name": "The Empire State Building",
  "description": "The Empire State Building is a 102-story landmark in New York City.",
  "image": "http://www.civil.usherbrooke.ca/cours/gci215a/empire-state-building.jpg",
  "geo": {
    "latitude": "40.75",
    "longitude": "73.98"
  }
}

It should be possible to verify that an acyclic graph has proper blank node canonicalization in linear time. So I think we'd be talking about an order of magnitude improvement in performance and significantly simpler code. The downside of this approach is that 1) not all RDF graphs are acyclic, even common graphs like OWL ontologies generally have cycles and 2) there are no existing client implementations - URDNA2015 isn't super widespread but it has some level of adoption.

Regarding schemas, I can get into more details later but I basically the same arguments apply. A validation language that applies to generalized RDF graphs (with cycles) like SHACL will be more computationally complex than a schema for an acyclic graph (such as JSON schema) which can probably be evaluated in linear time. Again, there is the same issue in that the RDF community is trying to adopt SHACL and deviating from that would require different client implementations. I believe, however, that implementing all of SHACL on chain would be quite a stretch and that likely we would implement a subset at best, or maybe something similar that covers our use cases.

Thoughts?

clevinson · 2021-01-21T01:17:26Z

clevinson
Jan 21, 2021
Maintainer

Do we have any concrete use cases for cyclic graphs? I admite I don't have a comprehensive understanding of all the ways in which we would be using RDF here (thinking in the context of a practical use case or application), but i struggle to think of use cases that require cyclic graphs for us.

Blockchain is indeed a different context than W3C is normally used to operating in, so I do believe that we probably are operating under a different set of constraints than the standards bodies here.... that much doesn't surprise me.

But i'm also feeling wary of us brewing our own everything... We are already down the path of doing our own protobuf code generation (which I think sounds like the right decision), but I would love for us to be able to actually leverage open source tooling if we're adopting open source standards (such as RDF generally)...

The nice thing about using subsets of RDF that we support though, is that the inplementation of tooling around it can be more customized and better suited towards our actual needs (and ideally be smaller and more maintainable as well).

@aaronc can you give some examples of OWL ontologies or other RDF graphs that we would want to store on Regen Ledger that have cycles? What are the end-user tradeoffs of not supporting cyclic graphs? If there aren't any real use cases for cyclic graphs in Regen's use case, then it's really a question of evaluating the engineering costs of writing / maintaining such tooling.

1 reply

aaronc Jan 21, 2021
Maintainer Author

I can't think of use cases. But simply not allowing cycles limits graphs significantly beyond the core RDF model...

Also, I should note that when I say no cycles, I also mean only one node can reference any other node... which is true in simple graphs but is another restriction...

blushi · 2021-01-21T12:20:07Z

blushi
Jan 21, 2021
Maintainer

As @clevinson pointed out, I believe we first need to evaluate whether or not we wanna support cyclic graphs. I'd also be curious to hear about such examples. I'm not feeling super confident in working on our own implementation restricted to acyclic graphs given our resources and other projects at Regen, although I don't have a clear estimation on the work needed in this case. Could you speak a bit more to that?

That being said, SHACL doesn't seem to have implementation in Go so we might need to implement our own anyway.

If we go for more computationally complex canonicalization and schema validation, then doing some benchmarking would be useful.
Also, could part of this be handled off-chain?

3 replies

aaronc Jan 21, 2021
Maintainer Author

I can't think of us having use cases for cycles. Maybe @blushi you could post some hypothetical data related to registry projects?

We could compromise and use URDNA2015 for canonicalization and use something with the complexity of JSON schema instead of Shacl for validation. I'll try to do a few experiments.

aaronc Jan 21, 2021
Maintainer Author

It would hard to be handle this stuff fully off chain and ensure that data on chain is valid. So I think it's about reducing computational complexity to the minimum...

blushi Jan 21, 2021
Maintainer

Maybe @blushi you could post some hypothetical data related to registry projects?

I've been starting to think about this, will post something soon here.

It would hard to be handle this stuff fully off chain and ensure that data on chain is valid. So I think it's about reducing computational complexity to the minimum...

Makes sense.

aaronc · 2021-01-21T15:09:28Z

aaronc
Jan 21, 2021
Maintainer Author

Actually I think it's possible to process a subset of SHACL with reasonable time complexity if everything is indexed in maps the right way...

0 replies

aaronc · 2021-01-21T19:29:56Z

aaronc
Jan 21, 2021
Maintainer Author

So one thing I didn't mention here is that I would like to add another (optional) step after canonicalization which is storing the canonicalized quads in a merkle tree (see https://www.notion.so/regennetwork/RDF-Graphs-and-Datasets-0313390483fd41d1aa0c02cb6caacec9#c4ce32ed0a8d451d9e7d1677f21c6b28).

This will allow revealing parts of graphs provably without revealing the full graph (as a privacy use case).

I did some benchmarks comparing the cost of URDNA2015 canonicalization vs building an SMT merkle tree (which we are evaluating as a replacement for IAVL). (Benchmark code here: https://github.com/regen-network/regen-ledger/blob/aaronc/rdf-c14n-benchmarks/x/data/rdf_test.go)

This is a rough test and I think SMT could be optimized more, but basically the cost of building the SMT tree is about 2x the cost of running URDNA2015 in my benchmarks...

What I'm taking from this is that maybe URDNA2015 is fine as is considering its cheaper than the other thing I wanted to do when verifying hashes... Also, the first place for optimizing the canonicalization might be a faster hash function rather than an algorithm which restricts graphs for a speed up...

So maybe we more or less try to stick with the standards route (URDNA2015 + a subset of SHACL) and see how that looks in terms of code complexity for reviewing?? I am thinking it would be a custom implementation of both (as I'm not comfortable just including json-gold in the state machine as is since it's not designed with blockchains in mind).

0 replies

aaronc · 2021-01-21T22:49:05Z

aaronc
Jan 21, 2021
Maintainer Author

Did a quick spike on what a SHACL (or partial SHACL) implementation might look like and I think this model is probably feasible and can be made highly concurrent... so I'm less concerned about performance. (Code here if you're curious: https://github.com/regen-network/regen-ledger/compare/aarond/rdf-schema-spike).

0 replies

clevinson · 2021-01-23T00:32:06Z

clevinson
Jan 23, 2021
Maintainer

@aaronc and I just spoke about this today in a bit more detail. Seems like through his findings there aren't that many performance concerns with using the standard algorithms (URDNA2015 + subset of SHACL).

It sounds like it will be a lot easier for us to get started as well by just leveraging these standards, and seeing that the original concern was mostly about computational complexity (which now seems less an issue), i'm fine with us just proceeding with the standards described here and not having to worry about deviating with our own specific requirements.

@blushi How does that sound?

1 reply

blushi Jan 25, 2021
Maintainer

Starting by working with these standards sounds good to me.

aaronc · 2021-01-25T13:57:30Z

aaronc
Jan 25, 2021
Maintainer Author

So I realize for verifying URDNA2015, it might be reasonable to not actually very that quads are fully normalized on-chain, just that they are sorted and use proper _:c14n blank node prefixes. Because our hash functions are both pre-image and second pre-image resistant it would be basically impossible to 1) anchor a hash using a properly normalized URDNA2015 dataset, 2) construct a different dataset that is sorted, with _:c14n blank nodes, but isn't actually normalized and matches the same hash. Basically it can't happen...

This would mean that someone could store a dataset on-chain that just randomly assigned _:c14n prefixes to blank nodes, but I don't see there being either a big threat or high incentive in doing that.

Does that seem reasonable as a starting point? Then I don't even need to have an audited URDNA2015 implementation, just the SHACL subset code.

1 reply

clevinson Jan 27, 2021
Maintainer

Makes sense to me.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

On-chain RDF Canonicalization and Schema Validation #222

{{title}}

Replies: 7 comments 6 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

On-chain RDF Canonicalization and Schema Validation #222

aaronc Jan 20, 2021 Maintainer

Replies: 7 comments · 6 replies

clevinson Jan 21, 2021 Maintainer

aaronc Jan 21, 2021 Maintainer Author

blushi Jan 21, 2021 Maintainer

aaronc Jan 21, 2021 Maintainer Author

aaronc Jan 21, 2021 Maintainer Author

blushi Jan 21, 2021 Maintainer

aaronc Jan 21, 2021 Maintainer Author

aaronc Jan 21, 2021 Maintainer Author

aaronc Jan 21, 2021 Maintainer Author

clevinson Jan 23, 2021 Maintainer

blushi Jan 25, 2021 Maintainer

aaronc Jan 25, 2021 Maintainer Author

clevinson Jan 27, 2021 Maintainer

aaronc
Jan 20, 2021
Maintainer

Replies: 7 comments 6 replies

clevinson
Jan 21, 2021
Maintainer

aaronc Jan 21, 2021
Maintainer Author

blushi
Jan 21, 2021
Maintainer

aaronc Jan 21, 2021
Maintainer Author

aaronc Jan 21, 2021
Maintainer Author

blushi Jan 21, 2021
Maintainer

aaronc
Jan 21, 2021
Maintainer Author

aaronc
Jan 21, 2021
Maintainer Author

aaronc
Jan 21, 2021
Maintainer Author

clevinson
Jan 23, 2021
Maintainer

blushi Jan 25, 2021
Maintainer

aaronc
Jan 25, 2021
Maintainer Author

clevinson Jan 27, 2021
Maintainer