Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated check: version number #124

Open
kathryn-ods opened this issue Aug 14, 2024 · 5 comments
Open

Updated check: version number #124

kathryn-ods opened this issue Aug 14, 2024 · 5 comments
Assignees

Comments

@kathryn-ods
Copy link

kathryn-ods commented Aug 14, 2024

Currently in cove as inconsistent_schema_version_used this needs to be rewritten to allow for inconsistent minor versions.

Check: all Statements MUST have the same major version number.

On fail:

Error message: Statements have different major version numbers.
Info message: Version number (bodsVersion): [VALUE], Version number (bodsVersion): [VALUE2]

@kathryn-ods
Copy link
Author

@radix0000 does it make sense to implement this test at this point in time? Because it's only invalid if the major values don't match one of the invalid values would need to include a statement with e.g. "1.0" and "0.4" as 1.0 doesn't exist yet would that be flagged up for not being a valid bods version as well as having inconsistent values?

@kathryn-ods
Copy link
Author

@kd-ods you might be able to advise on the above now you're back

@kathryn-ods kathryn-ods assigned kathryn-ods and kd-ods and unassigned kathryn-ods Aug 21, 2024
@kd-ods
Copy link
Collaborator

kd-ods commented Nov 5, 2024

This is a special kind of check, since the outcome relates to how the whole dataset is processed. I think we should hold off implementing this. Pre- v1 things are having to be handled a little differently.

For future reference this is where I think we are and where we are going:

At this point (following the BODS 0.4 release)

When it comes to the DRT 'choosing' which version of the schema to validate a dataset against. It looks at the first statement in the dataset and:

  • if it has no publicationDetails.bodsVersion field, validates against BODS 0.1
  • if it has a publicationDetails.bodsVersion field with a valid BODS version, validates against it
  • if it has a publicationDetails.bodsVersion field with an invalid BODS version, validates against the latest version of BODS

@radix0000 - is that right? (We should document exactly what the process is.)

After BODS v1

This check, that 'all Statements MUST have the same major version number.' is done as part of the initial parsing of the data.

  • It passes if either (a) no statement has a publicationDetails.bodsVersion field or (b) all statements have a publicationDetails.bodsVersion field and all Statements have the same major version number
  • It fails if (c) some statements have a publicationDetails.bodsVersion field and some don't or (d) all statements have a publicationDetails.bodsVersion field but not the same major version number.

On fail: the dataset is not validated and the user gets an informative error message

On pass (case (a)): the dataset is validated against BODS 0.1

On pass (case (b)): the dataset is validated against the the latest MINOR.PATCH version release for the given MAJOR version number.

Reflections

Having worked through all that.... maybe post BODS v1 we should actually do a complete overhaul of the DRT too. We could relegate work so far to a 'beta' version then clean everything up for a v1 of the DRT. Then direct pre BODS v1 users to the beta version of the tool and BODS v1 + users to the new release. Then we don't need to maintain any overly-complicated BODS version-handling.

@radix0000
Copy link
Collaborator

@kd-ods Re DRT choosing a schema version, it is slightly more complicated that (because as well as not being present, the cases where bodsVersion isn't a string, or isn't in list of known versions need to be covered), but the main tweak I have introduced is that it detects whether it is record-based (i.e. if it has "recordDetails", "recordId" or "recordType" in the statement), and if so it doesn't use BODS 0.1 as the default, instead it uses the latest version (i.e. currently 0.4). Having these 2 categories record-based and non-record-based and having different defaults for each seems sensible to me (given how different they are) but let me know what you think. There is a question of what the best defaults are as well (e.g. out of 0.1, 0.2, and 0.3 what is the "most used" version and should we be using that as the default for non-record-based data?).

@kd-ods
Copy link
Collaborator

kd-ods commented Nov 8, 2024

Ah, thanks @radix0000. So is this a correct summary of what happens atm?

  1. The entire dataset is validated against a single schema version.

  2. The schema version is selected based on the contents of the first Statement in the array.

  3. If that first statement is 'record-based' the whole dataset is validated against bodsVersion (if it is present and valid). If that field is not present and valid then validation is against BODS 0.4.

  4. If that first statement is not record-based the whole dataset is validated against bodsVersion (if it is present and valid). If that field is not present and valid then validation is against BODS 0.1.

(If so - that looks sensible to me.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants