Evaluation, Reproducibility, Benchmarks Meeting 29

Minutes of Meeting 29

Date: 23rd October, 2024

Broadly:

Specific Ideas:

Ranking analysis implemented into MONAI (currently in R)
Confidence intervals when it comes to hierarchical datasets
We could have a shared codebase that is not mature, not for deployment, but meant for our experiments
Gathering the data necessary to inform the consensuses that we'd like to present
- Specifically, inference results for many algorithms over many tasks
Specifically, what recommendations would we like to make about inference and comparing algorithms?
- Proper use of bootstrapping
- Aggregation in the presence of hierarchical data (e.g. IPD meta analysis/random effects analysis)
- Graphs
- Recommendations w.r.t. standard deviation
  - Foundations for computing them
  - Across data points vs. across folds
More
- Do we need to move to parametric approaches?
  - If you fulfill the assumptions, it is usually more accurate and gives you more statistical power
  - Allows for Bayesian approaches (credible intervals, etc.)

Short term

It would be great to get something out soon that may not be complete, but covers many common use cases
- Would be very nice to include hierarchical data, but might not be feasible
  - Non-independence of the datasets needs to be addressed (video frames, for example)
- Re: Parametric vs nonparametric
  - We should let the empirical data guide us

Some preliminary data to start with

The decathalon might be a good fit
- We've already worked with this data
- Has many tasks
- Segmentation is very common
- Can this be shared?
  - Metrics themselves -- almost certainly. They are public in most cases
  - Predictions -- sure, but won't be useful without ground truth which can't be shared
  - Michela can compute new metrics if needed
MICCAI 2015 challenges
- Have lots of metrics for these, but they're somewhat outdated

Copyright (c) MONAI Consortium