Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions About Pretraining and Baselines #18

Open
guillaumejaume opened this issue Sep 10, 2024 · 13 comments
Open

Questions About Pretraining and Baselines #18

guillaumejaume opened this issue Sep 10, 2024 · 13 comments

Comments

@guillaumejaume
Copy link

guillaumejaume commented Sep 10, 2024

Hi,

Congrats on your accepted work! I'd have some questions to understand the model architecture and performance better.

  • What patch encoder did you use in the CLAM baseline? Is it based on ResNet50 pretrained on ImageNet or based on CHIEF features? What about the ABMIL and DSMIL baselines? I couldn't find this information unless I missed it (which is likely).

  • The supplemental material includes some information about UNI (Chen et al., Nat. Medicine, 2024) e.g., in Supp Table 26. Could you provide information about the evaluation scenario? The caption reports that UNI used carefully curated image patches from the TCGA-COADREAD dataset through supervised learning techniques. As UNI is a patch encoder trained with SSL using DINOv2, I'm not sure I understand what this means.

  • Are you using slide-level diagnostic labels (e.g., TCGA oncotree code) during CHIEF pretraining? You report that The second stage requires only WSI-level labels, enabling CHIEF to construct a holistic understanding of pathology images from global features. If so what is the objective used? The section CHIEF pretraining details mention both weakly-supervised learning and wsi-level contrastive learning. If diagnostic labels were used during pretraining, this presents potentially an unfair comparison with weakly-supervised methods like CLAM/DSMIL/ABMIL that presumably only use labels in the downstream task, e.g., Breast subtyping becomes a lot easier if explicitly trained for this task during pretraining.

  • Have you included all TCGA/PANDA slides in the 60K slides for CHIEF pretraining?

I realize this is a lot of questions, but your input would be very helpful in guiding me through the world of slide SSL.

Thanks!
Guillaume

@HHHedo
Copy link

HHHedo commented Sep 11, 2024

Hi xiyue,
Thanks for your great work!

  • I have the same question about the weakly-supervised pre-training.
  • Could you please provide more details about the downstream tasks? From my perspective, there are many ways to use the CHIEF, but I don't know which is more reasonable.
    1. Directly use the pre-trained CHIEF for inference. As mentioned in the caption of fig.1, "We then used the pathology imaging features extracted by CHIEF to infer cancer types directly." Could you further explain it?
    2. Use the fixed pre-trained image or image+text feature for downstream tasks by training a task-specific head.
    3. Use the pre-trained weights for initialization and then fine-tune the aggregator, the projection head from text encoder and task-specific heads.

Thanks,
Tiancheng

@Xiyue-Wang
Copy link
Collaborator

Hi,

Congrats on your accepted work! I'd have some questions to understand the model architecture and performance better.

  • What patch encoder did you use in the CLAM baseline? Is it based on ResNet50 pretrained on ImageNet or based on CHIEF features? What about the ABMIL and DSMIL baselines? I couldn't find this information unless I missed it (which is likely).
  • The supplemental material includes some information about UNI (Chen et al., Nat. Medicine, 2024) e.g., in Supp Table 26. Could you provide information about the evaluation scenario? The caption reports that UNI used carefully curated image patches from the TCGA-COADREAD dataset through supervised learning techniques. As UNI is a patch encoder trained with SSL using DINOv2, I'm not sure I understand what this means.
  • Are you using slide-level diagnostic labels (e.g., TCGA oncotree code) during CHIEF pretraining? You report that The second stage requires only WSI-level labels, enabling CHIEF to construct a holistic understanding of pathology images from global features. If so what is the objective used? The section CHIEF pretraining details mention both weakly-supervised learning and wsi-level contrastive learning. If diagnostic labels were used during pretraining, this presents potentially an unfair comparison with weakly-supervised methods like CLAM/DSMIL/ABMIL that presumably only use labels in the downstream task, e.g., Breast subtyping becomes a lot easier if explicitly trained for this task during pretraining.
  • Have you included all TCGA/PANDA slides in the 60K slides for CHIEF pretraining?

I realize this is a lot of questions, but your input would be very helpful in guiding me through the world of slide SSL.

Thanks! Guillaume

@Xiyue-Wang
Copy link
Collaborator

Xiyue-Wang commented Sep 12, 2024

@guillaumejaume

  1. Yes,

Hi,

Congrats on your accepted work! I'd have some questions to understand the model architecture and performance better.

  • What patch encoder did you use in the CLAM baseline? Is it based on ResNet50 pretrained on ImageNet or based on CHIEF features? What about the ABMIL and DSMIL baselines? I couldn't find this information unless I missed it (which is likely).
  • The supplemental material includes some information about UNI (Chen et al., Nat. Medicine, 2024) e.g., in Supp Table 26. Could you provide information about the evaluation scenario? The caption reports that UNI used carefully curated image patches from the TCGA-COADREAD dataset through supervised learning techniques. As UNI is a patch encoder trained with SSL using DINOv2, I'm not sure I understand what this means.
  • Are you using slide-level diagnostic labels (e.g., TCGA oncotree code) during CHIEF pretraining? You report that The second stage requires only WSI-level labels, enabling CHIEF to construct a holistic understanding of pathology images from global features. If so what is the objective used? The section CHIEF pretraining details mention both weakly-supervised learning and wsi-level contrastive learning. If diagnostic labels were used during pretraining, this presents potentially an unfair comparison with weakly-supervised methods like CLAM/DSMIL/ABMIL that presumably only use labels in the downstream task, e.g., Breast subtyping becomes a lot easier if explicitly trained for this task during pretraining.
  • Have you included all TCGA/PANDA slides in the 60K slides for CHIEF pretraining?

I realize this is a lot of questions, but your input would be very helpful in guiding me through the world of slide SSL.

Thanks! Guillaume

Thanks

  1. Yes, the methods for running comparisons are extracted from the official code.
  2. We copied the results from the published manuscript and compared the MSI mutation task performance to make sure the fidelity.
  3. No, We applied only anatomical information. ‘If so what is the objective used?’:Used to train negative and positive.
    "an unfair comparison with weakly-supervised methods like CLAM/DSMIL/ABMIL". Note that several of our weakly supervised methods(aggregation network)have been pretrained together using the same WSI as CHIEF
  4. some only FFPE slide of TCGA, and all PANDA

@Xiyue-Wang
Copy link
Collaborator

Hi xiyue, Thanks for your great work!

  • I have the same question about the weakly-supervised pre-training.

  • Could you please provide more details about the downstream tasks? From my perspective, there are many ways to use the CHIEF, but I don't know which is more reasonable.

    1. Directly use the pre-trained CHIEF for inference. As mentioned in the caption of fig.1, "We then used the pathology imaging features extracted by CHIEF to infer cancer types directly." Could you further explain it?
    2. Use the fixed pre-trained image or image+text feature for downstream tasks by training a task-specific head.
    3. Use the pre-trained weights for initialization and then fine-tune the aggregator, the projection head from text encoder and task-specific heads.

Thanks, Tiancheng

  1. "We then used the pathology imaging features extracted by CHIEF to infer cancer types directly." Could you further explain it?
    Because the weights of our slide-level aggregator are trained for this cancer classification task
  2. Get Use the fixed pre-trained image. And Use the pre-trained weights for initialization and then fine-tune the aggregator is better

@HHHedo
Copy link

HHHedo commented Sep 12, 2024

@Xiyue-Wang
Thanks for your help! I have a deeper understanding of your work now. However, I still have some follow-up questions.

  • According to your answers,

"No, We applied only anatomical information. ‘If so what is the objective used?’:Used to train negative and positive."

and

"Because the weights of our slide-level aggregator are trained for this cancer classification task".

I am a little bit confused about them. If only anatomical information is used, the weakly supervised task is to predict which anatomical site the WSI comes from, which is already encoded in the text encoder. If the slide-level diagnostic labels are used for weakly supervised pre-training, then the task is cancer classification, which makes sense to me that the slide-level aggregator can be directly used to infer cancer types.

@Xiyue-Wang
Copy link
Collaborator

@Xiyue-Wang Thanks for your help! I have a deeper understanding of your work now. However, I still have some follow-up questions.

  • According to your answers,

"No, We applied only anatomical information. ‘If so what is the objective used?’:Used to train negative and positive."

and

"Because the weights of our slide-level aggregator are trained for this cancer classification task".

I am a little bit confused about them. If only anatomical information is used, the weakly supervised task is to predict which anatomical site the WSI comes from, which is already encoded in the text encoder. If the slide-level diagnostic labels are used for weakly supervised pre-training, then the task is cancer classification, which makes sense to me that the slide-level aggregator can be directly used to infer cancer types.

Yes ,you are great! If the slide-level diagnostic labels are used for weakly supervised pre-training, then the task is cancer classification, which makes sense to me that the slide-level aggregator can be directly used to infer cancer types.

@Dadatata-JZ
Copy link
Collaborator

Dadatata-JZ commented Sep 13, 2024

Hi @guillaumejaume .

Your questions and discussions are always welcome! And congrats to you as well. Your contributions to UNI and other studies have been insightful for advancing computational pathology.

In addition to Xiyue's responses, I wanted to add some to help facilitate the discussion on two of your questions. Kindly let us know!


The supplemental material includes some information about UNI (Chen et al., Nat. Medicine, 2024) e.g., in Supp Table 26. Could you provide information about the evaluation scenario? The caption reports that UNI used carefully curated image patches from the TCGA-COADREAD dataset through supervised learning techniques. As UNI is a patch encoder trained with SSL using DINOv2, I'm not sure I understand what this means.

To make sure we don't mis-interpret UNI, we extracted the numbers from the publication for comparison. At that moment, while it was on the arxiv, this part should be identical as the later accepted version. We summarized this ("UNI used carefully curated image patches from the TCGA-COADREAD dataset through supervised learning techniques") based on our understanding of the UNI paper on arXiv (pages 34–45). We took a quick calculation of the sum of the regions of interest (ROIs), which do not cover the sum of all available WSIs (nothing wrong since some artifacts or white background should be excluded). Therefore, we used "carefully curated" to make it clear. As the highlighted in bold, two labels were used for both training and evaluation, so we stated that the linear prober tuning was under supervision.

CRC MSI prediction based on TCGA CRC-MSI 3, 200:
The CRC MSI prediction task is based on the “TCGA CRC-MSI” dataset, which consists of 51,918 512 × 512 ROIs at approximately 0.5 mpp, extracted from H&E FFPE diagnostic histopathology WSIs of colorectal adenocarcinoma samples annotated and extracted from TCGA. ROIs were labeled with the following 2 classes according to the patient-level label of the sample: microsatellite instable (15,002 ROIs) and microsatellite stable (36,916 ROIs). For training and evaluation, we
used the official train-test folds (19,557:32,361 ROIs)
, which was used in linear probe, KNN, and SimpleShot
evaluation. We evaluate this dataset on resized ROIs of 448 × 448 image resolution at 0.57 mpp. To mitigate
potential biases from site-specific H&E staining variability in TCGA, we used Macenko normalization132 to
normalize all ROIs.


Are you using slide-level diagnostic labels (e.g., TCGA oncotree code) during CHIEF pretraining? You report that The second stage requires only WSI-level labels, enabling CHIEF to construct a holistic understanding of pathology images from global features. If so what is the objective used? The section CHIEF pretraining details mention both weakly-supervised learning and wsi-level contrastive learning. If diagnostic labels were used during pretraining, this presents potentially an unfair comparison with weakly-supervised methods like CLAM/DSMIL/ABMIL that presumably only use labels in the downstream task, e.g., Breast subtyping becomes a lot easier if explicitly trained for this task during pretraining.

**We used anatomical sites (e.g., BREAST, LUNG) rather than TCGA labels (e.g., LUAD, LUSC). Nevertheless, understood BRCA subtyping was greatly included in UNI's evaluation, but not in CHIEF's report or claimed goals. **

@Dadatata-JZ
Copy link
Collaborator

Dadatata-JZ commented Sep 13, 2024

@Xiyue-Wang Thanks for your help! I have a deeper understanding of your work now. However, I still have some follow-up questions.

  • According to your answers,

"No, We applied only anatomical information. ‘If so what is the objective used?’:Used to train negative and positive."

and

"Because the weights of our slide-level aggregator are trained for this cancer classification task".

I am a little bit confused about them. If only anatomical information is used, the weakly supervised task is to predict which anatomical site the WSI comes from, which is already encoded in the text encoder. If the slide-level diagnostic labels are used for weakly supervised pre-training, then the task is cancer classification, which makes sense to me that the slide-level aggregator can be directly used to infer cancer types.

Hi @HHHedo,

Great question! We never directly predicted the cancer type (e.g., brca, crc, luad etc as Guillaume asked, but positive or negative); instead, we performed a binary classification as malignancy detection for task 1 and predicted the tumor origin for task 2. In reality, we should always know the anatomical site from which the tissue is obtained. However, it is not necessary to be the tumor origin.

I hope this clears up your confusion. Pls lmk!

@HHHedo
Copy link

HHHedo commented Sep 13, 2024

Hi @Dadatata-JZ,

Thanks for your kind help!

The following is my understanding.
In the weakly-supervised pre-training, the text encoder encodes the anatomical site information, and the image aggregator achieves the WSI-level features. After concatenating text and image features, the pre-training task is to predict the WSI-level labels. The key concern here is what the labels are, and I think they are likely to be the cancer type. And this is supported by xiyue's answer

  1. "We then used the pathology imaging features extracted by CHIEF to infer cancer types directly." Could you further explain it?
    Because the weights of our slide-level aggregator are trained for this cancer classification task

, but was contradicted to your answer

We never directly predicted the cancer type

, which confused me again. Hopefully, the code of pre-training will be released. Many thanks!

In the downstream tasks, since the downstream codes have already been released, I noticed that the biomarker and cancer cell detection used the text encoder, while the tasks of survival and tumor origin used only the image encoder.
Could you please further provide a more intuitive understanding of how to use CHIEF in different downstream tasks?

thanks,
Tiancheng

@Dadatata-JZ
Copy link
Collaborator

Dadatata-JZ commented Sep 13, 2024

Hi @Dadatata-JZ,

Thanks for your kind help!

The following is my understanding. In the weakly-supervised pre-training, the text encoder encodes the anatomical site information, and the image aggregator achieves the WSI-level features. After concatenating text and image features, the pre-training task is to predict the WSI-level labels. The key concern here is what the labels are, and I think they are likely to be the cancer type. And this is supported by xiyue's answer

  1. "We then used the pathology imaging features extracted by CHIEF to infer cancer types directly." Could you further explain it?
    Because the weights of our slide-level aggregator are trained for this cancer classification task

, but was contradicted to your answer

We never directly predicted the cancer type

, which confused me again. Hopefully, the code of pre-training will be released. Many thanks!

In the downstream tasks, since the downstream codes have already been released, I noticed that the biomarker and cancer cell detection used the text encoder, while the tasks of survival and tumor origin used only the image encoder. Could you please further provide a more intuitive understanding of how to use CHIEF in different downstream tasks?

thanks, Tiancheng

@HHHedo

Tiancheng, no worries at all. All good questions.

Sorry for confusing you. They are not contradictory bc my response was referring to the fact that cancer type (e.g., brca, crc, luad etc as Guillaume asked) inference was never in the four major downstream tasks (i.e., cancer detection, tumor origin, molecular classification, survival prediction), while figure 1 caption, Xiyue's and your understanding should relate to its use (cancer type inference [i.e., positive, negative]) in pre-training (see method) .

I may misunderstand the context of your post, "If the slide-level diagnostic labels are used for weakly supervised pre-training, then the task is cancer classification, which makes sense to me that the slide-level aggregator can be directly used to infer cancer types."

For fine-tuning a specific downstream task, such as genetic profiling or prognostic predictions, where both internal and external validations focus on the same cancer type, incorporating text embeddings is unnecessary.

@HHHedo
Copy link

HHHedo commented Sep 19, 2024

Hi @Dadatata-JZ @Xiyue-Wang ,
Thanks for your help!
After reading about other issues and SCL-WC, I still do not fully understand CHIEF.
The following is my understanding. Please help correct my misunderstandings.

Weakly-supervised pre-training:
The text encoder encodes the anatomical site information, and the image aggregator achieves the WSI-level features. After concatenating text and image features, the pre-training task is to use class-specific attentions and classifiers to predict the WSI-level labels [i.e., positive, negative]. I think 19 attentions and classifiers are used for pre-training since the over 60k training WSIs are from 19 anatomic sites.

Downstream tasks:
Based on Xiyue's answer in (#23) and the released codes,

For the downstream tasks (e.g., genomic profile or prognostic prediction), we employed fine-tuning methods. However, for the cancer detection task, we used CHIEF to directly infer from raw features without fine-tuning, applying CHIEF to 15 unseen datasets.

  1. cancer detection: Directly applying the both image and text encoders of CHIEF to 15 unseen datasets. I think these 15 datasets also belong to the 19 anatomic sites used in pre-training.
  2. biomarker, survival and tumor origin : Fine-tuning only the image encoder, and where these downstream datasets should also belong to the pre-training 19 types.

So, no matter directly inferring or fine-tuning, the key assumption here is that the downstream datasets are limited by the pre-training anatomic sites. Am I right? Is there any way to make CHIEF extensible to other cancer categories beyond the 19 types?

@Dadatata-JZ
Copy link
Collaborator

@HHHedo
Hi Tiancheng,

Happy to elaborate further on this. Opinions are my own.

In CHIEF, we’re working with 19 anatomic sites (just to clarify, histologically they should cover more than 19 distinct cancer types). For example, both lung adenocarcinoma and lung squamous cell carcinoma fall under the "Lung" category. Many other cancer types (e.g., leukemia) may originate from organs that CHIEF doesn't currently cover.

Histologically, cancers across different sites may display similar patterns of abnormal cell growth, invasion, and differentiation, regardless of their anatomical origin, which hopefully can help CHIEF or other foundation models expand to include other uncovered sites and cancer types. However, it is still an open question. More further investigations are encouraging as more real-world data becomes available. We’ve already been running some trials, stay tuned!

Please feel free to reach out via email for any brainstorming for designing experiments and models to answer these research questions, such as how foundation models can be generalized to the unseen. Will be interesting!

@amy-galon
Copy link

@HHHedo Hi Tiancheng, it's unlikely that we would get a clear answer about the architecture and pretraining. It took them days and people calling them out (#20 (comment), #23) to acknowledge that their method was largely similar to SCL-WC with text embedding contrasting, recent evidence suggests that such complex pre-training is rather useless (#24) even on the anatomic sites used for training. We will extend this analysis to every tissue type included in CHIEF and all of their tasks, others are welcome to do the same and we are certain they will reach the same conclusion. The only real way to fully introspect and take apart how they pre-trained their model would be to look at the training code and investigate the reproducibility which the authors have not released even though they indicated they would do so in the nature article, "The source codes for CHIEF are available at https://github.com/hms-dbmi/CHIEF." (https://www.nature.com/articles/s41586-024-07894-z#code-availability)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants