Skip to content

Evaluation, Reproducibility, Benchmarks Meeting 5

AReinke edited this page Oct 30, 2020 · 3 revisions

Minutes of meeting 5

Date: 28th October 2020

Present: Lena, Carole, Bennett, David, Annika, Jorge, Michela, Paul


TOP 1: New team member

  • Bennett Landman joined the working group (see TOP 5)

TOP 2: Feedback from MONAI Bootcamp


TOP 3: Task forces reports

Easy data access task force (Michela)

  • Survey is ready, email draft to organizers ready
  • Send email + survey from personal email address to organizers (Michela will prepare email and Lena will send the email to organizers)

Metrics task force (Carole):

  • Call with MONAI development team
  • Waiting for the evaluation implementation to be ready
  • Working on characteristics/definition on metrics (e.g. correlation between metrics)

Benchmarking task force (Annika):

  • Working on visualization toolkit for benchmark results
  • Goal: MONAI outputs the evaluation metrics in the toolkit’s csv input format

TOP 4: DELPHI workshop

  • Milestone: BIAS guideline was integrated in the equator network

  • Goal: BP wrt to metrics in biomedical imaging

  • Decision on task categories: Instance/semantic segmentation, classification, detection

  • Date: 30th November

  • Potential speakers:

    • Michal Kozubek, Carole Sudre: Segmentation
    • Henning Müller: Classification
    • (Pierre Jannin: Registration)
    • Paul Jäger: Detection
    • Bram van Ginneken: Comparing AI with human expert performance
    • Julio Saez-Rodriguez: maybe experience from other initiatives (DREAM challenges)
    • Jorge Cardoso

Questionnaire (for workshop preparation)

  • From your perception: What are the most critical problems with respect to metrics in the field of biomedical image analysis?
  • What are topics that you would like to discuss during the workshop?
  • What would you like to see as the main output of the workshop (which would then serve as basis for the questionnaire-based DELPHI process)?
  • Given a new (clinical) problem: Based on which characteristics do you phrase the problem as a specific task (e.g. segmentation vs detection)?
  • For each task XY (segmentation, classification, detection)
    • Which (potentially complementary) properties should be assessed by a XY metric?.
    • Available metrics
      • This is a list of metrics that have been used in XY tasks. Are you aware of any other?
      • Given a new (clinical) problem phrased as XY task: Based on which characteristics is an ideal metric typically chosen?
      • Can you point to papers on best practices for XY tasks inside or outside the field of medicine?
      • Are you aware of the misuse of specific metrics?
        • What are theoretical failure cases?
        • What are potential practical pitfalls (potentially resulting from specific implementations)?
        • Can you point to papers that highlight problems related to the choice of metrics (inside or outside the field of medicine)?
    • Relation between metrics
      • Are you aware of papers that highlight the relation between different metrics?
  • Aggregation of metrics
    • What are pitfalls regarding the aggregation of metric values?
  • Comments (anything)

TOP 5: Bennett's initiative

  • Model zoo with standard API
  • Inference only
  • Collaboration with MONAI
  • 9 months budget
  • Model zoo (Research WG) with Bennett’s initiative
Clone this wiki locally