Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Schema representation - 'importing' types #64

Open
chupaty opened this issue Dec 2, 2024 · 2 comments
Open

Schema representation - 'importing' types #64

chupaty opened this issue Dec 2, 2024 · 2 comments

Comments

@chupaty
Copy link
Contributor

chupaty commented Dec 2, 2024

The specification states the following:

A schema or protocol may not contain multiple definitions of a fullname. Further, a name must be defined before it is used (“before” in the depth-first, left-to-right traversal of the JSON parse tree, where the types attribute of a protocol is always deemed > to come “before” the messages attribute.)

This means that the following schema is valid (the A type is defined in b_field_one and the type is not defined with b_field_two):

{
  "name":"B",
  "type":"record",
  "fields":[
    {
      "name":"b_field_one",
      "type":{"name":"A","type":"record","fields":[]}
    },
    {
      "name":"b_field_two",
      "type":{"name":"A"}
    },    
  ]
}

This crate makes it easy to load a schemata in the form of a sequence (Schema::parse_list) of schema where:

  • The ordering rule is relaxed (no need to load the schemata in a particular order)
  • The defined-once rule is applied to the whole sequence, instead of per schema

However I can't find a way to output an individual schema in a schemata that is 'complete' (ie does not depend on other schemas, and otherwise complies with the rules). For example:

   let schema_str_1 = r#"{
        "name": "A",
        "doc": "A's schema",
        "type": "record",
        "fields": [
        ]
    }"#;
    let schema_str_2 = r#"{
        "name": "B",
        "doc": "B's schema",
        "type": "record",
        "fields": [
            {"name": "b_field_one", "type": "A"},
            {"name": "b_field_two", "type": "A"}
        ]
    }"#;
    let schema_strs = [schema_str_1, schema_str_2];
    let schemata = Schema::parse_list(&schema_strs)?;

    for d in schemata {
        println!("{}",d.canonical_form());
    }

Gives the following canonical schemas (for A I think this is OK, B is problematic as there is no definition for A):

{"name":"A","type":"record","fields":[]}
{"name":"B","type":"record","fields":[{"name":"b_field_one","type":"A"},{"name":"b_field_two","type":"A"}]}

I believe the canonical form for B should actually be:

{"name":"B","type":"record","fields":[{"name":"b_field_one","type":{"name":"A","type":"record","fields":[]}},{"name":"b_field_two","type":"A"}]}

Do we have a way of producing this 'correct' form? This is necessary for fingerprint calculation and some schema registry interactions.

@chupaty
Copy link
Contributor Author

chupaty commented Dec 3, 2024

I've raised a PR to provide this here: #66

@chupaty
Copy link
Contributor Author

chupaty commented Dec 4, 2024

Sorry about the noise - I've added a fix for nested Refs (where a schema depends on another that depends on another), and re-opened the PR. Added a test for this as well.

The goal here is ultimately to support interop with schema registries where each schema is stored independently of other schemata in the registry.

I haven't changed the current functionality (Schema::canonical_form()), EXCEPT, you'll notice a change in the two test cases:

  • test_avro_3370_record_schema_with_currently_parsing_schema_named_enum()
  • test_avro_3370_record_schema_with_currently_parsing_schema_named_fixed()

In both cases, the 'expected' canonical form was (I believe) incorrect, as they both include two (duplicate) definitions for a type.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant