diff --git a/README.md b/README.md index adf9bc4..1b76ca2 100644 --- a/README.md +++ b/README.md @@ -1,14 +1,12 @@ # HEST-Library: Bringing Spatial Transcriptomics and Histopathology together ## Designed for querying and assembling HEST-1k dataset -\[ [arXiv](https://arxiv.org/abs/2406.16192) | [Download](https://huggingface.co/datasets/MahmoodLab/hest) | [Documentation](https://hest.readthedocs.io/en/latest/) | [Tutorials](https://github.com/mahmoodlab/HEST/tree/main/tutorials) \] +\[ [arXiv](https://arxiv.org/abs/2406.16192) | [Data](https://huggingface.co/datasets/MahmoodLab/hest) | [Documentation](https://hest.readthedocs.io/en/latest/) | [Tutorials](https://github.com/mahmoodlab/HEST/tree/main/tutorials) | [Cite](https://github.com/mahmoodlab/hest?tab=readme-ov-file#citation) \] - - Welcome to the official GitHub repository of the HEST-Library introduced in *"HEST-1k: A Dataset for Spatial Transcriptomics and Histology Image Analysis", NeurIPS Spotlight, 2024*. This project was developed by the [Mahmood Lab](https://faisal.ai/) at Harvard Medical School and Brigham and Women's Hospital. -HEST-1k, HEST-Library, and HEST-Benchmark are released under the Attribution-NonCommercial-ShareAlike 4.0 International license. +
@@ -17,6 +15,8 @@ HEST-1k, HEST-Library, and HEST-Benchmark are released under the Attribution-Non - **HEST-Library:** A series of helpers to assemble new ST samples (ST, Visium, Visium HD, Xenium) and work with HEST-1k (ST analysis, batch effect viz and correction, etc.) - **HEST-Benchmark:** A new benchmark to assess the predictive performance of foundation models for histology in predicting gene expression from morphology +HEST-1k, HEST-Library, and HEST-Benchmark are released under the Attribution-NonCommercial-ShareAlike 4.0 International license. +
## Updates @@ -85,27 +85,28 @@ In addition, we provide complete [documentation](https://hest.readthedocs.io/en/ ## HEST-Benchmark -The HEST-Benchmark was designed to assess foundation models for pathology under a new, diverse, and challenging benchmark. HEST-Benchmark includes 10 tasks for gene expression prediction (50 highly variable genes) from morphology (112 x 112 um regions at 0.5 um/px) in 10 different organs and 9 cancer types. We provide a step-by-step tutorial to run HEST-Benchmark and reproduce our results in [4-Running-HEST-Benchmark.ipynb](https://github.com/mahmoodlab/HEST/tree/main/tutorials/4-Running-HEST-Benchmark.ipynb). +The HEST-Benchmark was designed to assess 11 foundation models for pathology under a new, diverse, and challenging benchmark. HEST-Benchmark includes nine tasks for gene expression prediction (50 highly variable genes) from morphology (112 x 112 um regions at 0.5 um/px) in nine different organs and eight cancer types. We provide a step-by-step tutorial to run HEST-Benchmark and reproduce our results in [4-Running-HEST-Benchmark.ipynb](https://github.com/mahmoodlab/HEST/tree/main/tutorials/4-Running-HEST-Benchmark.ipynb). ### HEST-Benchmark results (08.30.24) -HEST-Benchmark was used to assess 10 publicly available models. +HEST-Benchmark was used to assess 11 publicly available models. Reported results are based on a Ridge Regression with PCA (256 factors). Ridge regression unfairly penalizes models with larger embedding dimensions. To ensure fair and objective comparison between models, we opted for PCA-reduction. Model performance measured with Pearson correlation. Best is **bold**, second best is _underlined_. Additional results based on Random Forest and XGBoost regression are provided in the paper. -| **Dataset** | **[Hoptimus0](https://github.com/bioptimus/releases/blob/main/models/h-optimus/v0/LICENSE.md)** | **[Virchow2](https://huggingface.co/paige-ai/Virchow2)** | **[Virchow](https://huggingface.co/paige-ai/Virchow)** | **[UNI](https://huggingface.co/MahmoodLab/UNI)** | **[Gigapath](https://huggingface.co/prov-gigapath/prov-gigapath)** | **[CONCH](https://huggingface.co/MahmoodLab/CONCH)** | **[Phikon](https://huggingface.co/owkin/phikon)** | **[Remedis](https://arxiv.org/abs/2205.09723)** | **[CTransPath](https://www.sciencedirect.com/science/article/abs/pii/S1361841522002043)** | **[Resnet50](https://arxiv.org/abs/1512.03385)** | **[Plip](https://www.nature.com/articles/s41591-023-02504-3)** | -|:--------------|----------------:|---------------:|--------------:|-------------:|---------------:|---------------:|-------------:|--------------:|-----------------:|---------------:|-----------:| -| **IDC** | **0.5988** | 0.5903 | 0.5725 | 0.5718 | 0.5505 | 0.5363 | 0.5327 | 0.5304 | 0.511 | 0.4732 | 0.4717 | -| **PRAD** | 0.3768 | 0.3478 | 0.3341 | 0.3095 | **0.3776** | 0.3548 | 0.342 | 0.3531 | 0.3427 | 0.306 | 0.2819 | -| **PAAD** | **0.4936** | 0.4716 | 0.4926 | 0.478 | 0.476 | 0.4475 | 0.4441 | 0.4647 | 0.4378 | 0.386 | 0.4099 | -| **SKCM** | **0.6521** | 0.613 | 0.6056 | 0.6344 | 0.5607 | 0.5784 | 0.5334 | 0.5816 | 0.5103 | 0.4825 | 0.5117 | -| **COAD** | 0.3054 | 0.252 | **0.3115** | 0.2876 | 0.2595 | 0.2579 | 0.2573 | 0.2528 | 0.249 | 0.231 | 0.0518 | -| **READ** | **0.2209** | 0.2109 | 0.1999 | 0.1822 | 0.1888 | 0.1617 | 0.1631 | 0.1216 | 0.1131 | 0.0842 | 0.0927 | -| **CCRCC** | 0.2717 | **0.275** | 0.2638 | 0.2402 | 0.2436 | 0.2179 | 0.2423 | 0.2643 | 0.2279 | 0.218 | 0.1902 | -| **LUNG** | **0.5605** | 0.5554 | 0.5433 | 0.5499 | 0.5412 | 0.5317 | 0.5522 | 0.538 | 0.5049 | 0.4919 | 0.4838 | -| **LYMPH_IDC** | 0.2578 | **0.2598** | 0.2582 | 0.2537 | 0.2491 | 0.2507 | 0.2373 | 0.2465 | 0.2354 | 0.2284 | 0.2382 | -| **AVG** | **0.4153** | 0.3973 | 0.3979 | 0.3897 | 0.383 | 0.3708 | 0.3672 | 0.3726 | 0.348 | 0.3224 | 0.3035 | +| Model | IDC | PRAD | PAAD | SKCM | COAD | READ | ccRCC | LUAD | LYMPH IDC | Average | +|------------------------|--------|--------|--------|--------|--------|--------|--------|--------|-----------|---------| +| **[Resnet50](https://arxiv.org/abs/1512.03385)** | 0.4741 | 0.3075 | 0.3889 | 0.4822 | 0.2528 | 0.0812 | 0.2231 | 0.4917 | 0.2322 | 0.326 | +| **[CTransPath](https://www.sciencedirect.com/science/article/abs/pii/S1361841522002043)** | 0.511 | 0.3427 | 0.4378 | 0.5106 | 0.2285 | 0.11 | 0.2279 | 0.4985 | 0.2353 | 0.3447 | +| **[Phikon](https://huggingface.co/owkin/phikon)** | 0.5327 | 0.342 | 0.4432 | 0.5355 | 0.2585 | 0.1517 | 0.2423 | 0.5468 | 0.2373 | 0.3656 | +| **[CONCH](https://huggingface.co/MahmoodLab/CONCH)** | 0.5363 | 0.3548 | 0.4475 | 0.5791 | 0.2533 | 0.1674 | 0.2179 | 0.5312 | 0.2507 | 0.3709 | +| **[Remedis](https://arxiv.org/abs/2205.09723)** | 0.529 | 0.3471 | 0.4644 | 0.5818 | 0.2856 | 0.1145 | 0.2647 | 0.5336 | 0.2473 | 0.3742 | +| **[Gigapath](https://huggingface.co/prov-gigapath/prov-gigapath)** | 0.5508 | _0.3708_ | 0.4768 | 0.5538 | _0.301_ | 0.186 | 0.2391 | 0.5399 | 0.2493 | 0.3853 | +| **[UNI](https://huggingface.co/MahmoodLab/UNI)** | 0.5702 | 0.314 | 0.4764 | 0.6254 | 0.263 | 0.1762 | 0.2427 | 0.5511 | 0.2565 | 0.3862 | +| **[Virchow](https://huggingface.co/paige-ai/Virchow)** | 0.5702 | 0.3309 | 0.4875 | 0.6088 | **0.311** | 0.2019 | 0.2637 | 0.5459 | 0.2594 | 0.3977 | +| **[Virchow2](https://huggingface.co/paige-ai/Virchow2)** | 0.5922 | 0.3465 | 0.4661 | 0.6174 | 0.2578 | 0.2084 | **0.2788** | **0.5605** | 0.2582 | 0.3984 | +| **UNIv1.5** | **0.5989** | 0.3645 | _0.4902_ | _0.6401_ | 0.2925 | _0.2240_ | 0.2522 | _0.5586_ | **0.2597** | _0.4090_ | +| **[Hoptimus0](https://github.com/bioptimus/releases/blob/main/models/h-optimus/v0/LICENSE.md)** | _0.5982_ | **0.385** | **0.4932** | **0.6432** | 0.2991 | **0.2292** | _0.2654_ | 0.5582 | _0.2595_ | **0.4146** | ### Benchmarking your own model @@ -122,6 +123,9 @@ Our tutorial in [4-Running-HEST-Benchmark.ipynb](https://github.com/mahmoodlab/H ## Citation If you find our work useful in your research, please consider citing: + +Jaume, G., Doucet, P., Song, A. H., Lu, M. Y., Almagro-Perez, C., Wagner, S. J., Vaidya, A. J., Chen, R. J., Williamson, D. F. K., Kim, A., & Mahmood, F. HEST-1k: A Dataset for Spatial Transcriptomics and Histology Image Analysis. _Advances in Neural Information Processing Systems_, December 2024. + ``` @inproceedings{jaume2024hest, author = {Guillaume Jaume and Paul Doucet and Andrew H. Song and Ming Y. Lu and Cristina Almagro-Perez and Sophia J. Wagner and Anurag J. Vaidya and Richard J. Chen and Drew F. K. Williamson and Ahrong Kim and Faisal Mahmood}, diff --git a/figures/fig1a.jpeg b/figures/fig1a.jpeg index bd5c6e1..aeecfab 100644 Binary files a/figures/fig1a.jpeg and b/figures/fig1a.jpeg differ diff --git a/tutorials/README.md b/tutorials/README.md new file mode 100644 index 0000000..871a3c4 --- /dev/null +++ b/tutorials/README.md @@ -0,0 +1,28 @@ +# HEST-1k Tutorials + +Welcome to the HEST-1k tutorial repository! This set of tutorials provides a step-by-step guide to working with HEST-1k and the HEST-Library. + +## Tutorials + +### 1. Downloading HEST-1k.ipynb +This notebook guides you through downloading the HEST-1k dataset using HuggingFace. It includes details on dataset structure and requirements. + +### 2. Interacting with HEST-1k.ipynb +Learn how to load and explore HEST-1k data. This notebook introduces tools for inspecting data contents, exploring sample images, and performing initial analyses to understand dataset attributes. + +### 3. Assembling HEST Data.ipynb +This tutorial provides instructions on assembling HEST data from raw files into a structured format ready for analysis. + +### 4. Running HEST Benchmark.ipynb +Run benchmarks on the HEST-1k dataset. + +### 5. Batch-effect visualization.ipynb +This notebook is dedicated to visualizing batch effects within the HEST-1k dataset. It covers methods to identify, understand, and mitigate batch effects. + +--- + +## Contributions + +External contributions are welcome! If you have ideas for improving these tutorials or would like to contribute, please feel free to reach out to [gjaume@bwh.harvard.edu](mailto:gjaume@bwh.harvard.edu). + +If you encounter any issues, please check the GitHub Issues section, as other users might have already faced similar challenges.