-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Evaluation, Reproducibility, Benchmarks Meeting 5
AReinke edited this page Oct 30, 2020
·
3 revisions
Date: 28th October 2020
Present: Lena, Carole, Bennett, David, Annika, Jorge, Michela, Paul
- Bennett Landman joined the working group (see TOP 5)
- Only few participants => Post poll on twitter
- Slides with Results: https://docs.google.com/presentation/d/1SrL7Rp7NbRcJwCgRLVCZ3gqNlb_Xh8d7oo7qVx8XAZU/edit?usp=sharing
- Consider different types of MONAI Bootcamps (e.g. for validation)
- Survey is ready, email draft to organizers ready
- Send email + survey from personal email address to organizers (Michela will prepare email and Lena will send the email to organizers)
- Call with MONAI development team
- Waiting for the evaluation implementation to be ready
- Working on characteristics/definition on metrics (e.g. correlation between metrics)
- Working on visualization toolkit for benchmark results
- Goal: MONAI outputs the evaluation metrics in the toolkit’s csv input format
-
Milestone: BIAS guideline was integrated in the equator network
-
Goal: BP wrt to metrics in biomedical imaging
-
Decision on task categories: Instance/semantic segmentation, classification, detection
-
Date: 30th November
-
Potential speakers:
- Michal Kozubek, Carole Sudre: Segmentation
- Henning Müller: Classification
- (Pierre Jannin: Registration)
- Paul Jäger: Detection
- Bram van Ginneken: Comparing AI with human expert performance
- Julio Saez-Rodriguez: maybe experience from other initiatives (DREAM challenges)
- Jorge Cardoso
- From your perception: What are the most critical problems with respect to metrics in the field of biomedical image analysis?
- What are topics that you would like to discuss during the workshop?
- What would you like to see as the main output of the workshop (which would then serve as basis for the questionnaire-based DELPHI process)?
- Given a new (clinical) problem: Based on which characteristics do you phrase the problem as a specific task (e.g. segmentation vs detection)?
- For each task XY (segmentation, classification, detection)
- Which (potentially complementary) properties should be assessed by a XY metric?.
- Available metrics
- This is a list of metrics that have been used in XY tasks. Are you aware of any other?
- Given a new (clinical) problem phrased as XY task: Based on which characteristics is an ideal metric typically chosen?
- Can you point to papers on best practices for XY tasks inside or outside the field of medicine?
- Are you aware of the misuse of specific metrics?
- What are theoretical failure cases?
- What are potential practical pitfalls (potentially resulting from specific implementations)?
- Can you point to papers that highlight problems related to the choice of metrics (inside or outside the field of medicine)?
- Relation between metrics
- Are you aware of papers that highlight the relation between different metrics?
- Aggregation of metrics
- What are pitfalls regarding the aggregation of metric values?
- Comments (anything)
- Model zoo with standard API
- Inference only
- Collaboration with MONAI
- 9 months budget
- Model zoo (Research WG) with Bennett’s initiative