Storing Package schema / metadata and using it to validate new datasets in json #675

pchtsp · 2021-04-29T19:00:28Z

pchtsp
Apr 29, 2021

Hello everyone,

(this may be similar to #637).
I'm unable to see if frictionless formatting is an ideal alternative or not for us and I hope you can convince me it is. Our data-problem is the following:

we have "schemas" and "datasets". We currently use "jsonschema" for our schemas. A dataset is just a bunch of data (in json format) that matches one of our pre-defined schemas.
These schemas are really a list of tables with foreign keys and primary keys relationships, etc. Just like a database. Thus, the interest of migrating to frictionlessdata structures.

I share a small example in example.zip

In the example I have a jsonschema file: (see: instance.json)
And a dataset that complies to it: (see: input.json)
I just migrated the jsonschema to frictionless, (see: instance_frictionless.json)

I see in all the examples and reference docs that you always store a reference to the "data file" (,csv, .json) inside the schema/metadata.

We do not have a real coupling between them: we store the schemas (which could be PackageSchemas in frictionless-like terminology) in our database and we want to use the schemas to validate datasets that arrive from many places (an api endpoint in our python backend, in a browser through our javascript front-end, etc.).

I'm not sure how to mix a json-data (dict-like in python) object and validate with my Frictionless PackageSchema without having to write it into a file.

I'm looking for something like what we currently do with jsonschema:

from jsonschema import Draft7Validator

def validate_data(data):
    # get the schema from a file or a database ....
    schema = get_schema()
    # load validator with schema:
    validator = Draft7Validator(schema)

    # feed the validator some data and check for errors
    return validator.iter_errors(data)

Is this possible? Or something that's equivalent? Thanks!

Answered by roll

May 3, 2021

@pchtsp
Hi, if I got you right, the migration from your JSONSchema approach to a Frictionless approach would be in creating a list of individual Table Schemas (not resources).

An excerpt from https://framework.frictionlessdata.io/docs/guides/describing-data/#describing-a-resource

Using programming terminology we could say that:

Table Schema descriptor is abstract (for a class of files)

Data Resource descriptor is concrete (for an individual file)

So consider you store somewhere:

durations.schema.json
jobs.schema.json
etc

You can then use them like this:

from frictionless import validate

report = validate(input['durations'], schema='durations.schema.json')

Another approach, would be…

View full answer

lwinfree · 2021-04-30T14:51:27Z

lwinfree
Apr 30, 2021

tagging @roll to take a look :-)

@pchtsp I was going to suggest that you look at #637 but you've already seen it! They are definitely related, so we'll try to answer you here, but I might eventually link/deduplicate these into one discussion. Thanks for bringing this up and for the clear explanation!

1 reply

rufuspollock May 3, 2021
Maintainer

@pchtsp - @roll is the expert here and can answer the best. However, my 2c are that:

Yes, you can absolutely have the schema separate from the data and use the schema to validate the data using e.g. the frictionless-py library.

roll · 2021-05-03T16:47:57Z

roll
May 3, 2021
Maintainer

@pchtsp
Hi, if I got you right, the migration from your JSONSchema approach to a Frictionless approach would be in creating a list of individual Table Schemas (not resources).

An excerpt from https://framework.frictionlessdata.io/docs/guides/describing-data/#describing-a-resource

Using programming terminology we could say that:

Table Schema descriptor is abstract (for a class of files)

Data Resource descriptor is concrete (for an individual file)

So consider you store somewhere:

durations.schema.json
jobs.schema.json
etc

You can then use them like this:

from frictionless import validate

report = validate(input['durations'], schema='durations.schema.json')

Another approach, would be creating a data package template (it can be exactly the same you have created -- not having path/data):

package.template.json

{
  "name": "DataSchema",
  "resources": [
    {
      "name": "DurationsSchema",
      "schema": {
        "fields": [
          {
            "name": "duration",
            "type": "number",
            "constraints": {
              "required": true
            }
          },
          {
            "name": "job",
            "type": "number",
            "constraints": {
              "required": true
            }
          },
          {
            "name": "mode",
            "type": "number",
            "constraints": {
              "required": true
            }
          }
        ]
      }
    },
    {
      "name": "JobsSchema",
      "schema": {
        "fields": [
          {
            "name": "id",
            "type": "number",
            "constraints": {
              "required": true
            }
          },
          {
            "name": "successors",
            "type": "number",
            "constraints": {
              "required": true
            }
          }
        ]
      }
    },
    {
      "name": "NeedsSchema",
      "schema": {
        "fields": [
          {
            "name": "job",
            "type": "number",
            "constraints": {
              "required": true
            }
          },
          {
            "name": "mode",
            "type": "number",
            "constraints": {
              "required": true
            }
          },
          {
            "name": "need",
            "type": "number",
            "constraints": {
              "required": true
            }
          },
          {
            "name": "resource",
            "type": "string",
            "constraints": {
              "required": true
            }
          }
        ]
      }
    },
    {
      "name": "ResourcesSchema",
      "schema": {
        "fields": [
          {
            "name": "available",
            "type": "number",
            "constraints": {
              "required": true
            }
          },
          {
            "name": "id",
            "type": "string",
            "constraints": {
              "required": true
            }
          },
          {
            "name": "type",
            "type": "string"
          }
        ]
      }
    }
  ]
}

Then you need some code on top Frictionless to make it work:

from frictionless import Package, validate

def validate_package_using_template(input):
    package = Package('package.template.json')
    for name, data in input.items():
        # Here we link out template's resource with actual data
        package.get_resource(name).data = data
    return validate(package)

validate_package_using_template(input)

PS.
To check whether a metadata object valid you can use package/resource/schema.metadata_valid

2 replies

pchtsp May 4, 2021
Author

Thanks @roll! I think the second case is what I was looking for. I'm guessing the first one would not be able to check foreign keys (because I would be checking each table separately). I'll go try it out and come back if I see anything weird. Thanks again!

roll May 4, 2021
Maintainer

Yes, the second one will handle foreign keys 👍

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Storing Package schema / metadata and using it to validate new datasets in json #675

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 3 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Storing Package schema / metadata and using it to validate new datasets in json #675

pchtsp Apr 29, 2021

Replies: 2 comments · 3 replies

lwinfree Apr 30, 2021

rufuspollock May 3, 2021 Maintainer

roll May 3, 2021 Maintainer

pchtsp May 4, 2021 Author

roll May 4, 2021 Maintainer

pchtsp
Apr 29, 2021

Replies: 2 comments 3 replies

lwinfree
Apr 30, 2021

rufuspollock May 3, 2021
Maintainer

roll
May 3, 2021
Maintainer

pchtsp May 4, 2021
Author

roll May 4, 2021
Maintainer