You can either build off of this repository template or use it as reference to build your scripts from scratch. Provided here is a sample evaluation template in Python. R support TBD.
- Python 3.10+
- Docker (if containerizing manually)
-
Determine the format of the predictions file, as this will help create the list validation checks. Things to consider include:
- file type (CSV? TSV?)
- number of columns
- column header names
- column types
- if column type is a number or float, is there is minimum value? Maximum?
In addition to format, also consider:
- can there be more than one prediction per ID/sample/patient?
- does every ID/sample/patient need a prediction, or can some be null/NA?
-
Update
validate.py
so that it fits your needs. The template currently implements the following checks:- two columns named
id
andprobability
(extraneous columns will be ignored) id
values are stringsprobability
values are floats between 0.0 and 1.0, and cannot be null/None- there is one prediction per patient (so, no missing patient IDs or duplicate patient IDs)
- there are no extra predictions (so, no unknown patient IDs)
- two columns named
-
Update
requirements.txt
with any additional libraries/packages used by the script. -
(optional) Locally run
validate.py
to ensure it can run successfully.python validate.py \ -p PATH/TO/PREDICTIONS_FILE.CSV \ -g PATH/TO/GOLDSTANDARD_FILE.CSV
STDOUT will either be
VALIDATED
orINVALID
, and full details of the validation check will be printed toresults.json
.
-
Determine the evaluation metrics and how they can be computed. We recommend evaluating at least two metrics: one for primary ranking and the other for breaking ties. You can also include additional metrics to give the participants more information about their performance, such as sensitivity, specificity, precision, etc.
-
Update
score.py
so that it fits your needs. The template currently evaluates for:- Area under the receiver operating characteristic curve (AUROC)
- Area under the precision-recall curve (AUPRC)
-
Update
requirements.txt
with any additional libraries/packages used by the script. -
(optional) Locally run
score.py
to ensure it can run successfully in addition to returning expected scores.python score.py \ -p PATH/TO/PREDICTIONS_FILE.CSV \ -g PATH/TO/GOLDSTANDARD_FILE.CSV
STDOUT will either be
SCORED
orINVALID
, and scores will be appended to an existingresults.json
.
This template repository comes with a workflow that will containerize the scripts for you. To trigger the workflow, you will need to create a new release. For tag versioning, we recommend following the SemVar versioning schema.
This workflow will create a new image within your repository, accessible under Packages. Here is an example of the deployed image for this template.
You can also use a Docker registry other than ghcr, for example: DockerHub. The only requirement is that the image must be publicly accessible so that the ORCA workflow can access it.
To containerize your scripts:
-
Open a terminal and switch directories to your local copy of the repository.
-
Run the command:
docker build -t IMAGE_NAME:TAG_VERSION FILEPATH/TO/DOCKERFILE
where:
- IMAGE_NAME: name of your image.
- TAG_VERSION: version of the image. If TAG_VERSION is not supplied,
latest
will be used. - FILEPATH/TO/DOCKERFILE: filepath to the Dockerfile, in this case, it will be the current directory (
.
)
-
If needed, log into your registry of choice.
-
Push the image:
docker push IMAGE_NAME:TAG_VERSION