-
Notifications
You must be signed in to change notification settings - Fork 5
File System
The core repository of the knowledge base is file system based, essentially a collection of thousands of files gathered from various sources see harvesting. The purpose is to enable the harvesting , curation, storage, validation of these resources, before loading into analytical engines.
To achieve this, we have designed a file system with strict naming and content management conventions, essentially creating a on disk database for managing resource. The design principle is that every resource, independently of its type, is packaged in the same way, reflecting many of the principles advocated by FAIR and underlying FAIR Digital Objects.
The knowledge base repository is organized as follows, using OAS as an illustrative example.
kb/ --> the knowlegde base file system root
├─ oas/ --> The Open API Specificion knowledge base
| ├─ ingestion/ --> area where harvested files are dropped
| | ├─ kin/ --> the collection of file harvested by Kin
| | | ├─ foo.json --> harvested foo.json API specificaton
| | | ├─ foo.meta.json --> ingestion metadata about foo.json
| | | ├─ ...
| ├─ repository/ --> knowlegde based contaent
| | ├─ kin/ --> the valid and curated files from kin package as a resource
| | | ├─ foo.json/ --> the directory holds the knowlde on the foo.json specification
| | | | ├─ data.json --> the resource
| | | | ├─ meta.json --> the resource metadata
| | | | ├─ oas.validation.json --> output from OAS validation (empty if no error)
| | | | ├─ OpenAPI3.class --> empty file that informs on the object class
| | | | ├─ meta.json --> the resource metadata
├─ foo/ --> Other knowledge base (e.g. GraphQL, spcctral, AsyncAPI, ...)
├─ ...
Each object in the repository lives in a dedicated directory holding all the knowledge about the resource:
- The original data files is stored in the
_files
directory and documented in the object metadata - A clean serialization of the data for analysis and indexing is stored in
data.json
- The object repository metadata properties are stored in a
meta.json
file - Additional metadata files can exists, ending with a meta.json extensions
-
ingest.meta.json
: Metadata captured by the harvesters or during ingestion
-
- Sub-directories can created by generic and specialized tools for various purposes.
- Directories starting with an underscore _ are reserved for the management of the repository
- Adding a directory must be discussed and approved by repository administrator
- For JSON/YAML files, outputs of spectral reports can be generated in files named as:
-
spectral.<rule-set-identifier-and-version>.json
: Output from the Spectral CLI tool for a specific ruleset
-
These file can then be directly analyzed, or loaded/indexed in various databases and search engines. The above is expected to grow as tools create new content around these resources.
If a new version of an object is ingested, the previous knowledge and files are saved in the _version
directory (not yet implemented)
Example of meta.json
file (2023-01-11)
{
"class": "OpenAPI3",
"id": "fd80d4f45a56ac096269b5982d4ff6a1",
"uri": "postman:opentech:kb:oas:kin:OpenAPI3:fd80d4f45a56ac096269b5982d4ff6a1",
"schemaVersion": "3.0.1",
"files": [
{
"class": "File",
"id": "S6HEaZBaGIXeHQoo",
"uri": "postman:File:S6HEaZBaGIXeHQoo",
"path": "_files",
"name": "0lsen-openweathermapyml.json",
"description": "original data",
"size": 9780,
"updated": "2022-11-23T17:23:32",
"md5": "0c3a60f1095817a46c9d80187948f30c",
"sha1": "2aa24771e393448414e6ff3bd59c1fd09137456c"
}
]
}
Python based ingestion and processing scripts are orchestrated to pull/push content from S3 as needed, parse the incoming files to determine if they are valid, transfer to the repository in a dedicated directory, validate, and index.
For OAS, the following is currently being performed:
- ingestion script: parses all
.json
and.yaml
files in the ingestion collection directory, inspects the content to determine of aswagger
oropenapi
specification, and transfers to a dedicated directory in the repository. The script only processed files that have been changed. - validation script: opens all updated
data.json
files in the repository, detects the OAS version, and runs schema validation. Creates aoas.validation.json
file holding errors if the file is invalid. TheisValid
property is set in themeta.json
file accordingly. The script only processed files that have been changed. - SQL indexing: loads the content of all resources in the repository into our Postgre database tables. The data and meta files are stored in JSON fields so can be queried accordingly in the database and front-ending APIs
- Files harvested for the repository can be accompanied by a sister file providing metadata about the resource (provenance, timestamps, and other characteristics)
- The metadata file must have the same name as the base file with a
.meta.json
or.meta.yaml
extension. For example, if you have a file calledfoo.json
, you would store additional information about the file infoo.meta.json