Evaluation, Reproducibility, Benchmarks Meeting 5

Minutes of meeting 5

Date: 28th October 2020

Present: Lena, Carole, Bennett, David, Annika, Jorge, Michela, Paul

Only few participants => Post poll on twitter
Slides with Results: https://docs.google.com/presentation/d/1SrL7Rp7NbRcJwCgRLVCZ3gqNlb_Xh8d7oo7qVx8XAZU/edit?usp=sharing
Consider different types of MONAI Bootcamps (e.g. for validation)

Survey is ready, email draft to organizers ready
Send email + survey from personal email address to organizers (Michela will prepare email and Lena will send the email to organizers)

Call with MONAI development team
Waiting for the evaluation implementation to be ready
Working on characteristics/definition on metrics (e.g. correlation between metrics)

Milestone: BIAS guideline was integrated in the equator network
Goal: BP wrt to metrics in biomedical imaging
Decision on task categories: Instance/semantic segmentation, classification, detection
Date: 30th November
Potential speakers:
- Michal Kozubek, Carole Sudre: Segmentation
- Henning Müller: Classification
- (Pierre Jannin: Registration)
- Paul Jäger: Detection
- Bram van Ginneken: Comparing AI with human expert performance
- Julio Saez-Rodriguez: maybe experience from other initiatives (DREAM challenges)
- Jorge Cardoso

From your perception: What are the most critical problems with respect to metrics in the field of biomedical image analysis?
What are topics that you would like to discuss during the workshop?
What would you like to see as the main output of the workshop (which would then serve as basis for the questionnaire-based DELPHI process)?
Given a new (clinical) problem: Based on which characteristics do you phrase the problem as a specific task (e.g. segmentation vs detection)?
For each task XY (segmentation, classification, detection)
- Which (potentially complementary) properties should be assessed by a XY metric?.
- Available metrics
  - This is a list of metrics that have been used in XY tasks. Are you aware of any other?
  - Given a new (clinical) problem phrased as XY task: Based on which characteristics is an ideal metric typically chosen?
  - Can you point to papers on best practices for XY tasks inside or outside the field of medicine?
  - Are you aware of the misuse of specific metrics?
    - What are theoretical failure cases?
    - What are potential practical pitfalls (potentially resulting from specific implementations)?
    - Can you point to papers that highlight problems related to the choice of metrics (inside or outside the field of medicine)?
- Relation between metrics
  - Are you aware of papers that highlight the relation between different metrics?
Aggregation of metrics
- What are pitfalls regarding the aggregation of metric values?
Comments (anything)

Copyright (c) MONAI Consortium