Skip to content

File System

Pascal Heus edited this page Aug 25, 2023 · 5 revisions

The core repository of the knowledge base is file system based, essentially a collection of thousands of files gathered from various sources see harvesting. The purpose is to enable the harvesting , curation, storage, validation of these resources, before loading into analytical engines.

To achieve this, we have designed a file system with strict naming and content management conventions, essentially creating a on disk database for managing resource. The design principle is that every resource, independently of its type, is packaged in the same way, reflecting many of the principles advocated by FAIR and underlying FAIR Digital Objects.

The knowledge base repository is organized as follows, using OAS as an illustrative example.

kb/ --> the knowledge base file system root
├─ _ingestion/ --> area where harvested files are dropped
|  ├─ kin/ --> the collection of files harvested by Kin
|  |  ├─ foo.json --> harvested foo.json API specification
|  |  ├─ foo.meta.json --> ingestion metadata about foo.json
|  |  ├─ ...
├─ oas/ --> The Open API Specification knowledge base
|  ├─ repository/ --> knowledge based content
|  |  ├─ kin/ --> The valid and curated files from kin package as a resource
|  |  |  ├─ foo.json/ --> The directory holds the knowledge on the foo.json specification
|  |  |  |  ├─ data.json --> the resource
|  |  |  |  ├─ meta.json --> the resource metadata
|  |  |  |  ├─ oas.validation.json --> output from OAS validation (empty if no error)
|  |  |  |  ├─ OpenAPI3.class --> empty file that informs on the object class
|  |  |  |  ├─ meta.json --> the resource metadata
├─ foo/ --> Other knowledge base (e.g. GraphQL, spectral, AsyncAPI, ...)
├─ ...

Directory Encapsulation

Each object in the repository lives in a dedicated directory holding all the knowledge about the resource:

  • The original data files is stored in the _files directory and documented in the object metadata
  • A clean serialization of the data for analysis and indexing is stored in data.json
  • The object repository metadata properties are stored in a meta.json file
  • Additional metadata files can exists, ending with a meta.json extensions
    • ingest.meta.json: Metadata captured by the harvesters or during ingestion
  • Sub-directories can created by generic and specialized tools for various purposes.
    • Directories starting with an underscore _ are reserved for the management of the repository
    • Adding a directory must be discussed and approved by repository administrator
  • For JSON/YAML files, outputs of spectral reports can be generated in files named as:
    • spectral.<rule-set-identifier-and-version>.json: Output from the Spectral CLI tool for a specific ruleset

These file can then be directly analyzed, or loaded/indexed in various databases and search engines. The above is expected to grow as tools create new content around these resources.

If a new version of an object is ingested, the previous knowledge and files are saved in the _version directory (not yet implemented)

Example of meta.json file (2023-01-11)

{
    "class": "OpenAPI3",
    "id": "fd80d4f45a56ac096269b5982d4ff6a1",
    "uri": "postman:opentech:kb:oas:kin:OpenAPI3:fd80d4f45a56ac096269b5982d4ff6a1",
    "schemaVersion": "3.0.1",
    "files": [
        {
            "class": "File",
            "id": "S6HEaZBaGIXeHQoo",
            "uri": "postman:File:S6HEaZBaGIXeHQoo",
            "path": "_files",
            "name": "0lsen-openweathermapyml.json",
            "description": "original data",
            "size": 9780,
            "updated": "2022-11-23T17:23:32",
            "md5": "0c3a60f1095817a46c9d80187948f30c",
            "sha1": "2aa24771e393448414e6ff3bd59c1fd09137456c"
        }
    ]
}

Ingestion and Indexing

Python based ingestion and processing scripts are orchestrated to pull/push content from S3 as needed, parse the incoming files to determine if they are valid, transfer to the repository in a dedicated directory, validate, and index.

For OAS, the following is currently being performed:

  • ingestion script: parses all .json and .yaml files in the ingestion collection directory, inspects the content to determine of a swagger or openapi specification, and transfers to a dedicated directory in the repository. The script only processed files that have been changed.
  • validation script: opens all updated data.json files in the repository, detects the OAS version, and runs schema validation. Creates a oas.validation.json file holding errors if the file is invalid. The isValid property is set in the meta.json file accordingly. The script only processed files that have been changed.
  • SQL indexing: loads the content of all resources in the repository into our PostgreSQL database tables. The data and meta files are stored in JSON fields so can be queried accordingly in the database and front-ending APIs

Ingestion Metadata

  • Files harvested for the repository can be accompanied by a sister file providing metadata about the resource (provenance, timestamps, and other characteristics)
  • The metadata file must have the same name as the base file with a .meta.json or .meta.yaml extension. For example, if you have a file called foo.json, you would store additional information about the file in foo.meta.json
Clone this wiki locally