Releases: odissei-data/ingestion-workflow-orchestrator
V2.3.1 release
Updates
First .1 release for v2
- Included Postgres as DB
- Moved settings to .env
- Agent now runs nicely inside container as intended.
Updating dependencies and checking build
V2.0.11 Update docker-image.yml
2.0.0 (May 15, 2023)
Dynaconf configuration
The settings for the different data providers and target dataverse instances have been moved to settings tomls in the configuration directory. A entry ingestion workflow can now have a parameter to specify which settings dictionary it will use. By using this setup all workflows for specific dataverses or subverses have been removed.
min.io
The data that is ingested by the workflows in the orchestrator is now expected to be in a bucket in a min.io object store. Local data ingestion is no longer possible. The setup for the object store needs to be added to the .secrets.toml in the configuration directory.
Universal Dataverse2Dataverse ingestion workflow
All dataverse to dataverse ingestion is now done using the same workflow. Any specific differences between the source metadata that need to be refined are handled in the metadata-refiner service.
Minor changes
- Dockerfile update to allow for the easy addition of new poetry packages.
- Added jmespath for querying fields from JSON metadata.
- Mapping file has been updated for use with the new dataverse-mapper.
- CBS ingestion workflow now includes an email sanitation task.
- Added metadata refinement task that is used in d2d ingestion workflow.
- Refactored xml2json task to work with metadata fetched from minio.
- Metadata fetcher service now no longer needs a Dataverse source API key.
v1.0.0-beta
Beta v1.0.0
DataverseNL workflow
The DataverseNL workflow can used to ingest metadata from the DataverseNL dataverse instance to a Dataverse instance. The xml metadata is harvested using OAI PMH as oai_dc (dublin core). The harvested metadata is ingested using a prefect workflow. Every Subverse in DataverseNL that contains Social Science data has its own entry point workflow. All subverses use the same DataverseNL workflow for the actual ingestion of the metadata of the datasets.
The data is transformed to JSON, the ID is obtained and is used to fetch the Dataverse JSON metadata. This metadata is then cleaned and imported into Dataverse using. Finally, the publication date is updated and the dataset is published. All the tasks use external services except for the cleaning step.
File management
The entry workflows for the data providers to start the ingestion process have been put into a specific directory. The dataset ingestion workflows have also been added to a specific folder. both live under the flows directory in scripts.
Workflow versioning
A URL to the workflow version dictionary of a specific workflow is added to the metadata of the ingested metadata. The URL is added to a field in the provenance metadata block. The function that creates the dictionary is called in the entry point workflow. You can specify what services are used by a workflow. For every service you will get a dictionary that contains the latest GitHub release, the latest docker image tag, the service version, and its endpoint.
Alpha v0.2.0
Alpha v0.2.0
Features
New features for the orchestrator:
- Possible to deploy multiple workflows from different data providers at the same time
- Uses different env variables for each data provider (check new dot_env_example)
- Completed first version of the DataverseNL workflow
- Smaller fixes and changes to tasks and flows
Alpha v0.1.0
Alpha v.0.1.0
Description
This first version of the orchestrator works with local files and a .env for a specific data provider. It needs to be redeployed to switch the workflow to a different data provider. The next version will make it possible to deploy flows for all providers at the same time. Included data providers are EASY, CBS, LISS and DataverseNL. All workflows are still experimental and are not the finish product.