Ground truth for proto specification docs #111

westonpace · 2021-12-08T19:52:43Z

westonpace
Dec 8, 2021
Maintainer

This discussion was inspired by this comment: #109 (comment)

The question, if I understand it correctly, is whether documentation should go in the markdown files or in the proto files themselves. We currently seem to be rather split on this. There are many places where the .proto files are not documented at all or the documentation is much sparser than whats in the markdown. On the other hand, plan.proto is more thoroughly documented and there is no corresponding markdown document.

What should be the guideline going forwards? Is there a way we can reuse what we put in the proto files so we don't have to type everything twice?

cpcloud · 2021-12-08T23:54:40Z

cpcloud
Dec 8, 2021
Maintainer

I am in favor of keeping as much documentation as close to the code as possible, otherwise the documentation goes stale very fast. I would like to be able to understand the proto files in a self-contained way (referencing other proto files is fine). I don't want to hunt around a bunch of markdown looking for the relevant prose, it's just too hard to search without enforcing some specific structure to the prose, and if we're going to do that we should just put it in the protobufs.

0 replies

jacques-n · 2021-12-10T01:12:38Z

jacques-n
Dec 10, 2021
Maintainer

The intention of the project was to have two representations of data: human readable and proto. As part of that, the goal was to have a spec that the proto implements.

Implementation requires additional definition that the spec can only define abstractly. For example, a specification for how to represent date literals isn't defined in the spec. The spec only defines the range of allowed values and the meaning, not a specific representation. In proto, this may be a integer of days since epoch. In human-readable representation, this may be an iso date string.

I think that generic things should be defined in the spec. Serialization specific details should be defined in the serialization specific area of the spec or the serialization idl (e.g. .proto files). In some cases, it is a good idea to embed high-level concepts in the spec along with embedded proto such as here. I also think that a way to resolve some of this is to have more proto embedded in the spec in other places like here. The vision was to have all of the spec contain the corresponding representations embedded. By doing so, we can probably be less verbose in the spec and leverage content in the embedded idl boxes.

I appreciate the desire to make the proto "self-contained" but I think we should also avoid inverting the relationship between the spec and a specific representation. To me, the spec (markdown) comes first, the representation details come second. That being said, we clearly need to do a better job of keeping both up to date or it will be very hard for new users to adopt Substrait. We should review each patch with this in mind.

I also think that most people right now are thinking of the proto as the spec, as opposed to an implementation of the spec. That wasn't the initial intention. We can, of course, also reevaluate the initial intention.

4 replies

cpcloud Dec 10, 2021
Maintainer

I think the responses are misinterpreting what I am trying to convey. If we were primarily developing against a textual representation, I would say the same thing: whatever is being actively used, needs to be mapped back to spec in some way. Moreover, each concrete object needs a description that elucidates what its purpose is and how it's intended to be used (to the extent possible).

cpcloud Dec 10, 2021
Maintainer

In practice, keeping repeated information content up to date without automation is a fruitless endeavor, which is where my desire to keep things in one place comes from.

westonpace Dec 11, 2021
Maintainer Author

In some cases, it is a good idea to embed high-level concepts in the spec along with embedded proto such as here.

What's the guideline on this? Right now all embedded protobufs are in the "binary serialization" section which makes some sense. Can we embed protos in the more spec-like markdowns? If we have multiple serialization formats we can always figure out how to make a drop down or something the user can use to select which format they see.

jacques-n Dec 14, 2021
Maintainer

It was my intention to have them embedded throughout. Just hadn't gotten to it. We actually modeled it for three tabs: binary, text and example. (note the tabs here: https://substrait.io/expressions/embedded_functions/)

rdblue · 2021-12-10T23:15:52Z

rdblue
Dec 10, 2021
Collaborator

I agree with @jacques-n. I think that there isn't really an option other than to have the markdown spec be the source of truth. There are a few good reasons for that:

If other serialization formats are supported, we want the spec to be independent
Specs embedded within code are disjointed and awkward to read because they're organized for the code, not for understanding
Specs in code are limited. For example, how do you use tables? I'm guessing it's a bit more difficult than you'd want

Going a bit further, I also don't think that it is a good idea to couple protobuf and substrait too closely. It's really important to have a reference implementation, prove out ideas with real code, and make sure the spec can be implemented cleanly. But the idea for this project goes beyond a single serialization format. In other words, this is not a set of protobuf objects, it is a way to exchange plans. There are probably things that require extra specification that isn't enforced by protobuf and it's good to have times when you're not thinking in terms of the protobuf, but in terms of the spec. (Hopefully that made sense.)

4 replies

cpcloud Dec 10, 2021
Maintainer

I think this is a nice ideal, but in practice there needs to be a way to develop against the actual code without needing to do a complex mapping from prose to code, especially since that mapping is the code.

Going a bit further, I also don't think that it is a good idea to couple protobuf and substrait too closely. It's really important to have a reference implementation, prove out ideas with real code, and make sure the spec can be implemented cleanly. But the idea for this project goes beyond a single serialization format. In other words, this is not a set of protobuf objects, it is a way to exchange plans. There are probably things that require extra specification that isn't enforced by protobuf and it's good to have times when you're not thinking in terms of the protobuf, but in terms of the spec. (Hopefully that made sense.)

I think this makes sense, but it doesn't really reflect the day to day of building against the spec. The fact is, it's not reasonable to expect people to have to do the mapping from spec to proto, precisely because that's what the protos are: a mapping from the abstract spec to a real thing.

I personally would be happy with a requirement that all new (outer-most) protobufs must be documented and linked to the part of the spec they are related to.

westonpace Dec 11, 2021
Maintainer Author

The names of the protobuf elements should make it fairly obvious which part of the spec we are talking about so I'm not sure what you mean by "and linked to the part of the spec". Where that isn't the case I agree a link to some portion of the spec would be nice. If we do want links everywhere then we should agree on some format

// See $1.2.4
message SetRel {

or...

// See relations::logical_relations::set-operation

or...

// See https://substrait.io/relations/logical_relations/#set-operation

cpcloud Dec 11, 2021
Maintainer

I would like to be able to work with the serialization format without having to consult two different locations (the spec and the serialization format).

Links would be needed for places where there isn't an exact linguistic match between spec and format. I don't know how often this happens but as it stands I think it's too difficult to understand the mapping because it's not explicitly written anywhere that X message maps to Y spec thing (if such a mapping exists).

The meaning of the messages, especially the things that model more complex things like advanced extensions, absolutely need documentation alongside their code (in the code itself) or else it will be extremely difficult for newcomers to work with the project.

If we were using, say, JSON as a text serialization format, we'd put this information in the appropriate JSON schema file.

rdblue Dec 12, 2021
Collaborator

The extra work required is kind of the point, and the project supplies the protobuf code so that it is done correctly and is kept in sync. I don't think anyone is advocating for having people implement the spec with their own proto definitions. What about a JSON representation of the spec? If there were only a JSON schema, would you think it is reasonable for that to be the canonical spec for the format? Wouldn't you have concerns about translating to other serialization schemes?

I get that it is something that has to be kept in sync, but it doesn't seem like producing protos and producing a spec are the same thing and it's best not to conflate them.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ground truth for proto specification docs #111

{{title}}

Replies: 3 comments 8 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Ground truth for proto specification docs #111

westonpace Dec 8, 2021 Maintainer

Replies: 3 comments · 8 replies

cpcloud Dec 8, 2021 Maintainer

jacques-n Dec 10, 2021 Maintainer

cpcloud Dec 10, 2021 Maintainer

cpcloud Dec 10, 2021 Maintainer

westonpace Dec 11, 2021 Maintainer Author

jacques-n Dec 14, 2021 Maintainer

rdblue Dec 10, 2021 Collaborator

cpcloud Dec 10, 2021 Maintainer

westonpace Dec 11, 2021 Maintainer Author

cpcloud Dec 11, 2021 Maintainer

rdblue Dec 12, 2021 Collaborator

westonpace
Dec 8, 2021
Maintainer

Replies: 3 comments 8 replies

cpcloud
Dec 8, 2021
Maintainer

jacques-n
Dec 10, 2021
Maintainer

cpcloud Dec 10, 2021
Maintainer

cpcloud Dec 10, 2021
Maintainer

westonpace Dec 11, 2021
Maintainer Author

jacques-n Dec 14, 2021
Maintainer

rdblue
Dec 10, 2021
Collaborator

cpcloud Dec 10, 2021
Maintainer

westonpace Dec 11, 2021
Maintainer Author

cpcloud Dec 11, 2021
Maintainer

rdblue Dec 12, 2021
Collaborator