As of 2024-07-17 this repo contains both the production data model used by the ELITE portal to submit and validate metadata through the Data Curator App; and the data dictionary website which is based on the data model and provides definitions for all metadata templates and terms used in the data model.
There is a separate data-dictionary repo which contains the same source code, and which can later be used to deploy the website when we are able to set up automation in that repository which successfully monitors this repository for changes. To simplify the process, for now we will use this data-models repo to manage both the data model and the dictionary.
- EL Data Model
- EL Metadata Dictionary Site
- Other things you can do in this repository
EL.data.model.* (csv | jsonld): this is the current, "live" version of the EL Portal data model. It is being used by both the staging and production versions of the multitenant Data Curator App.
The main branch of this repo is protected, so you cannot push changes to main. To edit the data model, create a new branch of this repository and make changes to the attribute csv files in the modules/
subdirectory. Once you have made your changes, open a pull request. This will trigger a Github Action that automatically joins the attributes from the module csv, converts the csv data model to the json-ld format, and commits the changes to your PR. Please do not make changes to EL.data.model.csv
or EL.data.model.jsonld
by hand!
The full EL.data.model.csv
file has over 200 attributes and is unwieldy to edit and hard to review changes for. For ease of editing, the full data model is divided into "module" subfolders, like so:
data-models/
├── EL.data.model.csv (do not edit!)
├── EL.data.model.jsonld (do not edit!)
└── modules/
├── biospecimen/
│ ├── specimenID.csv
│ ├── organ.csv
│ └── tissue.csv
└── sequencing/
├── readLength.csv
└── platform.csv
Within each module, every attribute in the data model has its own csv, named after that attribute (example: organ.csv
).
Some common data model editing scenarios are:
- If you wanted to add a new valid value "eyeball" to our existing column attribute "organ", after making a new branch and opening the repo either locally or within a codespace, you would go to
modules/biospecimen/organ.csv
. - Next, find the row for the attribute "organ" (should be the first row), and w/in the valid values column, add "eyeball" to the comma-separated list of valid values.
- Save your changes and write an informative commit. Please try to add valid values alphabetically!
- If you wanted to add the column "furColor" to the "model-ad_individual_animal_metadata" template, first decide which module the new column should belong to. In this case, "MODEL-AD" makes the most sense.
- W/in the
MODEL-AD
subfolder, create a new csv calledfurColor.csv
with the required schematic column headers. Describe the attribute "furColor" as necessary and make sureParent
=ManifestColumn
. Add any valid values for "furColor" as new rows to this csv as described in the previous scenario. - Find the manifest template attributes in
modules/template/templates.csv
. In the "model-ad_individual_animal_metadata" row, add your new column "furColor" to the comma-separated list of attributes in theDependsOn
column. - Save your changes and write an informative commit.
For more advanced data modeling scenarios like adding conditional logic, creating validation rules, or creating new manifests, please consult the #ad-dcc-team slack channel.
A persistent issue is that manually editing csvs is challening. Some columns in our modules are very short, and others are veeeeery long (Description, Valid Values). Some options for working on csvs, and their pros and cons:
- Editing in the Github UI : convenient, but challenging to keep track of columns in plain text format.
- Cloning the repo, making a branch, and opening csvs locally in Excel or another spreadsheet program 🖥️ : probably the best UI experience, but involves a few extra steps with git.
- Using a Github codespace to launch VSCode in the browser, and editing with the pre-installed RainbowCSV extension 🌈 : Still difficult to edit csvs as plain text, but the color formatting and ability to use a soft word wrap makes it much easier to distinguish columns. RainbowCSV lets you designate "sticky" rows and columns for easier scrolling, and also has a nice "CSVLint" function that will check for formatting errors after you make changes.
If you add a new template manifest (e.g. for a new assay type), remove an existing manifest, or rename a manifest, you need to update the dca-template-config.yml
file that DCA uses to populate the menu contributors will use to select their template. To do this, you must manually trigger the Github Action create-template-config.yml
. This will re-create the DCA template config file and open a new PR with the changes. Review and merge the PR to complete the template config update. You can use the default input values provided when you manually trigger this workflow.
The Metadata Dictionary site is at: https://eliteportal.github.io/data-models/.
EL Metadata Dictionary is a Jekyll site utilizing Just the Docs theme and is published on GitHub Pages.
index.md
is the home page_config.yml
can be used to tweak Jekyll settings, such as theme, title_layout/
contains html templates we use to generate the web pages for each data model term_data/
folder stores data for Jekyll to use when generating the site- files in
docs/
will be accessed by GitHub Actions workflow to build the site - two scripts in
processes/
can be run to generate updated files in_data/
anddocs/
to publish changes in the data model to the dictionary site .env
contains the link to the data model that the dictionary site is based onGemfile
is package dependencies for buildling the websitepyproject.toml
andpoetry.lock
list the python and package dependencies for the scripts that update both the data model and the data dictionary site- You can add additional descriptions to home page or specific page by directly editing
index.md
or markdown files indocs/
.
-
The dictionary site materials should be updated after you make changes to the data model (see). Once a PR with changes is reviewed and merged into main, the Github Action in
update_metadata_dictionary.yml
should automatically start. This action will update the files in_data/
anddocs/
that are used to populate the dictionary website. -
Once any changes are detected in the
_data/
ordocs/
folders on the main branch, another Github action calledpages.yml
will run to update the deployment to the Github pages website. Verify that the dictionary site looks as expected at https://eliteportal.github.io/data-models/.
- Start your codespace or build a new one. The codespace should build with a container image that includes the package manager
poetry
. You don't need to install poetry. It should also run the commandpoetry install
after you launch it, which will tell poetry to install all the python libraries that are specified by this project (this will include schematic). - Make a new branch. On that branch, make and commit any changes. Please write informative commit messages in case we need to track down data model inconsistencies or introduced bugs.
- Still in the top-level directory, run
poetry run python data_model_creation/join_data_model.py
from the terminal. This will run a python script that joins all the module csvs, does a few data frame quality checks, and usesschematic schema convert
to create the updated json-ld data model. - If the script succeeds, double check the version control history of your json-ld data model and make sure the changes you expected have been made! Save and commit all changes, then push your local branch to the remote.
- Open a pull request and request review from someone else on the EL DCC team. The Github Action that runs when you open a PR will currently fail -- you can ignore this. EL DCC team will perform manual checks before merging changes.
- After the PR is merged, delete your branch.
- Start your codespace or build a new one. The codespace should build with a container image that includes the package manager
poetry
. You don't need to install poetry. It should also run the commandpoetry install
after you launch it, which will tell poetry to install all the python libraries that are specified by this project (this will include schematic).
Follow steps 2-4 above
- [Optional]: to generate a test manifest, run
poetry run schematic manifest -c path/to/config.yml get -dt RelevantDataType -s
from the terminal. This will generate a json schema, a manifest csv, and a link to a google sheet version of the manifest. DO NOT put any real data in the google sheet manifest! This is just an integration test to see if the manifest columns and drop downs look as expected. Don't commit the json schema and the manifest csv generated during this step to your branch -- these are ephemeral and should be deleted. - Open a pull request and request review from someone else on the EL DCC team. The Github Action that runs when you open a PR will currently fail -- you can ignore this. EL DCC team will perform manual checks before merging changes.
- After the PR is merged, delete your branch.
-
Start your codespace or build a new one. The codespace should build with a container image that includes the package manager
poetry
. You don't need to install poetry. It should also run the commandpoetry install
after you launch it, which will tell poetry to install all the python libraries that are specified by this project (this will include schematic). -
Make a new branch.
-
From the top-level data-models directory, run
poetry run python processes/data_manager.py
. This should update some files within_data/
-
Then run
poetry run python processes/page_manager.py
. This should update files withindocs/
. -
Optional: you can run
poetry run python processes/create_network_graph.py
to create the schema visualization network graph. This is out of date and relatively unused, but it will be good to update and make more robust later. -
Commit changes to your branch and open a PR. After review is passed and the changes are merged to main, a Github action will run via the
pages.yml
workflow to build and deploy the site to https://eliteportal.github.io/data-models/
- Make sure you have the
poetry
dependency manager installed in your workspace.
Follow steps 2-5 from the section above
-
Optional: Preview the website locally by running
bundle exec jekyll serve
. -
Commit changes to your branch and open a PR. After review is passed and the changes are merged to main, a Github action will run via the
pages.yml
workflow to build and deploy the site to https://eliteportal.github.io/data-models/
- Install Jekyll
gem install bundler jekyll
- Install Bundler
bundle install
- Run
bundle exec jekyll serve
to build your site and preview it athttp://localhost:4000
. The built site is stored in the directory_site
.
❓status unknown
Use scraping_valid_values.py
to pull in values from EBI OLS sources.
❓status unknown
dcc_config_repo_dispatch.yml
-- Not sure what this is for, still investigating its use. Authorization is failing.
Schematic API Visualization Repository
- Creates a network graph of the data model. Aim is to help see connections between components.
Software packages installed
- Poetry - See installation guide here
-
EL.data.model.csv
: The CSV representation of the example data model. This file is created by the collective effort of data curators and annotators from a community (e.g. ELITE), and will be used to create a JSON-LD representation of the data model. -
EL.data.model.jsonld
: The JSON-LD representation of the example data model, which is automatically created from the CSV data model using the schematic CLI. More details on how to convert the CSV data model to the JSON-LD data model can be found here. This is the central schema (data model) which will be used to power the generation of metadata manifest templates for various data types (e.g.,scRNA-seq Level 1
) from the schema. -
config.yml
: The schematic-compatible configuration file, which allows users to specify values for application-specific keys (e.g., path to Synapse configuration file) and project-specific keys (e.g., Synapse fileview for community project). A description of what the various keys in this file represent can be found in the Fill in Configuration File(s) section of the schematic docs.
After cloning the repository, run the following command:
poetry install
./change-log.md