debugging docs

jkobject · Aug 7, 2024 · a04b4ad · a04b4ad
1 parent 6929566
commit a04b4ad
Show file tree

Hide file tree

Showing 4 changed files with 150 additions and 43 deletions.
diff --git a/README.md b/README.md
@@ -10,7 +10,7 @@
 [![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
 [![DOI](https://zenodo.org/badge/391909874.svg)]()
 
-![logo](logo.png)
+![logo](docs/logo.png)
 
 scPRINT is a large transformer model built for the inference of gene networks (connections between genes explaining the cell's expression profile) from scRNAseq data.
 
@@ -25,7 +25,28 @@ scPRINT can be used to perform the following analyses:
 
 [Read the paper!](https://www.biorxiv.org/content/10.1101/2024.07.29.605556v1) if you would like to know more about scPRINT.
 
-![figure1](figure1.png)
+![figure1](docs/figure1.png)
+
+## Table of Contents
+
+- [scPRINT: Large Cell Model for scRNAseq data](#scprint-large-cell-model-for-scrnaseq-data)
+  - [Table of Contents](#table-of-contents)
+  - [Install `scPRINT`](#install-scprint)
+    - [lamin.ai](#laminai)
+  - [Usage](#usage)
+    - [scPRINT's basic commands](#scprints-basic-commands)
+    - [Notes on GPU/CPU usage with triton](#notes-on-gpucpu-usage-with-triton)
+    - [I want to generate gene networks from scRNAseq data:](#i-want-to-generate-gene-networks-from-scrnaseq-data)
+    - [I want to generate cell embeddings and cell label predictions from scRNAseq data:](#i-want-to-generate-cell-embeddings-and-cell-label-predictions-from-scrnaseq-data)
+    - [I want to denoising my scRNAseq dataset:](#i-want-to-denoising-my-scrnaseq-dataset)
+    - [I want to generate an atlas-level embedding](#i-want-to-generate-an-atlas-level-embedding)
+    - [I need to generate gene tokens using pLLMs](#i-need-to-generate-gene-tokens-using-pllms)
+    - [I want to pre-train scPRINT from scratch on my own data](#i-want-to-pre-train-scprint-from-scratch-on-my-own-data)
+    - [Documentation](#documentation)
+    - [Model Weights](#model-weights)
+  - [Development](#development)
+  - [Work in progress:](#work-in-progress)
+
 
 ## Install `scPRINT`
 
@@ -109,7 +130,7 @@ We now explore the different usages of scPRINT:
 
 -> Refer to the section . gene network inference in [this notebook](./docs/notebooks/cancer_usecase.ipynb#).
 
--> More examples in this notebook [./notebooks/assessments/bench_omni.ipynb](./notebooks/assessments/bench_omni.ipynb).
+-> More examples in this notebook [./notebooks/assessments/bench_omni.ipynb](./notebooks/bench_omni.ipynb).
 
 ### I want to generate cell embeddings and cell label predictions from scRNAseq data:
 
@@ -119,7 +140,7 @@ We now explore the different usages of scPRINT:
 
 -> Refer to the Denoising of B-cell section in [this notebook](./docs/notebooks/cancer_usecase.ipynb).
 
--> More example in our benchmark notebook [./notebooks/assessments/bench_denoising.ipynb](./notebooks/assessments/bench_denoising.ipynb).
+-> More example in our benchmark notebook [./notebooks/assessments/bench_denoising.ipynb](./notebooks/bench_denoising.ipynb).
 
 ### I want to generate an atlas-level embedding
 
@@ -129,7 +150,7 @@ We now explore the different usages of scPRINT:
 
 To run scPRINT, you can use the option to define the gene tokens using protein language model embeddings of genes. This is done by providing the path to a parquet file of the precomputed set of embeddings for each gene name to scPRINT via "precpt_gene_emb"
 
--> To generate this file please refer to the notebook [generate_gene_embeddings](docs/notebooks/generate_gene_embeddings.ipynb).
+-> To generate this file please refer to the notebook [generate_gene_embeddings](notebooks/generate_gene_embeddings.ipynb).
 
 ### I want to pre-train scPRINT from scratch on my own data
 

diff --git a/figure1.png → docs/figure1.png b/figure1.png → docs/figure1.png
diff --git a/docs/index.md b/docs/index.md
@@ -1,80 +1,166 @@
 
-# scprint
+# scPRINT: Large Cell Model for scRNAseq data
 
-[![codecov](https://codecov.io/gh/jkobject/scPRINT/branch/main/graph/badge.svg?token=scPRINT_token_here)](https://codecov.io/gh/jkobject/scPRINT)
-[![CI](https://github.com/jkobject/scPRINT/actions/workflows/main.yml/badge.svg)](https://github.com/jkobject/scPRINT/actions/workflows/main.yml)
+[![PyPI version](https://badge.fury.io/py/scprint.svg)](https://badge.fury.io/py/scprint)
+[![Documentation Status](https://readthedocs.org/projects/scprint/badge/?version=latest)](https://scprint.readthedocs.io/en/latest/?badge=latest)
+[![Downloads](https://pepy.tech/badge/scprint)](https://pepy.tech/project/scprint)
+[![Downloads](https://pepy.tech/badge/scprint/month)](https://pepy.tech/project/scprint)
+[![Downloads](https://pepy.tech/badge/scprint/week)](https://pepy.tech/project/scprint)
+[![GitHub issues](https://img.shields.io/github/issues/jkobject/scPRINT)](https://img.shields.io/github/issues/jkobject/scPRINT)
+[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
+[![DOI](https://zenodo.org/badge/391909874.svg)]()
 
-Awesome Large Transcriptional Model created by Jeremie Kalfon
+![logo](logo.png)
 
-scprint = single cell pretrained regulation inference neural network from transcripts
+scPRINT is a large transformer model built for the inference of gene networks (connections between genes explaining the cell's expression profile) from scRNAseq data.
 
-using: 
+It uses novel encoding and decoding of the cell expression profile and new pre-training methodologies to learn a cell model.
 
+scPRINT can be used to perform the following analyses:
 
-## Install it from PyPI
+- __expression denoising__: increase the resolution of your scRNAseq data
+- __cell embedding__: generate a low-dimensional representation of your dataset
+- __label prediction__: predict the cell type, disease, sequencer, sex, and ethnicity of your cells
+- __gene network inference__: generate a gene network from any cell or cell cluster in your scRNAseq dataset
 
-first have a good version of pytorch installed
+[Read the paper!](https://www.biorxiv.org/content/10.1101/2024.07.29.605556v1) if you would like to know more about scPRINT.
 
-you might need to make it match your cuda version etc..
+![figure1](figure1.png)
 
-We only support torch>=2.0.0
 
-then install laminDB
+## Install `scPRINT`
 
-```bash
-pip install 'lamindb[jupyter,bionty]'
+For the moment scPRINT has been tested on MacOS and Linux (Ubuntu 20.04) with Python 3.10.
+
+If you want to be using flashattention2, know that it only supports triton 2.0 MLIR's version and torch==2.0.0 for now.
+
+```python
+conda create -n "[whatever]" python==3.10
+git clone https://github.com/jkobject/scPRINT
+#one of
+pip install scPRINT # OR
+pip install scPRINT[dev] # for the dev dependencies (building etc..) AND/OR [dev,flash]
+pip install scPRINT[flash] && pip install -e "git+https:/
+/github.com/triton-lang/triton.git@legacy-backend
+#egg=triton&subdirectory=python" # to use flashattention2, you will need to install triton 2.0.0.dev20221202 specifically, working on removing this dependency # only if you have a compatible gpu (e.g. not available for apple GPUs for now, see https://github.com/triton-lang/triton?tab=readme-ov-file#compatibility)
 ```
 
-then install scPrint
+We make use of some additional packages we developed alongside scPRint.
 
-```bash
-pip install scprint
+Please refer to their documentation for more information:
 
-I had to install a specific version of pytorch, torchaudio, torchtext.. for my cuda version.
-My cuda compiler nvcc compiles cuda 11.7. my cuda-smi (api) is currently 12.1.
+- [scDataLoader](https://github.com/jkobject/scDataLoader): a dataloader for training large cell models.
+- [GRnnData](https://github.com/cantinilab/GRnnData): a package to work with gene networks from single cell data.
+- [benGRN](https://github.com/jkobject/benGRN): a package to benchmark gene network inference methods from single cell data.
 
-Please install all of it for your cuda version and it should still work.
+### lamin.ai
 
-for more information on this, please see [installation.md](installation.md).
-```
+⚠️ if you want to use the scDataloader's multi-dataset mode or if you want to preprocess datasets and other functions of the model, you will need to use lamin.ai.
+
+In that case, connect with google or github to [lamin.ai](https://lamin.ai/login), then be sure to connect before running anything (or before starting a notebook): `lamin login <email> --key <API-key>`. Follow the instructions on [their website](https://docs.lamin.ai/guide).
 
 ## Usage
 
+### scPRINT's basic commands
+
+This is the most minimal example of how scPRINT works:
+
 ```py
 from lightning.pytorch import Trainer
 from scprint import scPrint
 from scdataloader import DataModule
 
-...
+datamodule = DataModule(...)
 model = scPrint(...)
+# to train / fit / test the model
 trainer = Trainer(...)
 trainer.fit(model, datamodule=datamodule)
+# to do predictions Denoiser, Embedder, GNInfer
+denoiser = Denoiser(...)
+adata = sc.read_h5ad(...)
+denoiser(model, adata=adata)
+...
 ```
 
+or, from a bash command line
+
 ```bash
-$ python -m scPrint/__main__.py
-#or
-$ scprint fit/train/predict/test
+$ scprint fit/train/predict/test/denoise/embed/gninfer --config config/[medium|large|vlarge] ...
+```
+
+find out more about the commands by running `scprint --help` or `scprint [command] --help`.
+
+more examples of using the command line are available in the [docs](./docs/usage.md).
+
+### Notes on GPU/CPU usage with triton
+
+If you do not have [triton](https://triton-lang.org/main/python-api/triton.html) installed you will not be able to take advantage of GPU acceleration, but you can still use the model on the CPU.
+
+In that case, if loading from a checkpoint that was trained with flashattention, you will need to specify `transformer="normal"` in the `load_from_checkpoint` function like so:
+
+```python
+model = scPrint.load_from_checkpoint(
+    '../data/temp/last.ckpt', precpt_gene_emb=None,
+    transformer="normal")
 ```
 
-for more information on usage please see the documentation in https://jkobject.com/scPrint
+We now explore the different usages of scPRINT:
+
+### I want to generate gene networks from scRNAseq data:
+
+-> Refer to the section . gene network inference in [this notebook](./notebooks/cancer_usecase.ipynb#).
+
+-> More examples in this notebook [notebooks/assessments/bench_omni.ipynb](../notebooks/bench_omni.ipynb).
+
+### I want to generate cell embeddings and cell label predictions from scRNAseq data:
+
+-> Refer to the embeddings and cell annotations section in [this notebook](./notebooks/cancer_usecase.ipynb#).
+
+### I want to denoising my scRNAseq dataset:
+
+-> Refer to the Denoising of B-cell section in [this notebook](./notebooks/cancer_usecase.ipynb).
+
+-> More example in our benchmark notebook [notebooks/assessments/bench_denoising.ipynb](../notebooks/bench_denoising.ipynb).
+
+### I want to generate an atlas-level embedding
+
+-> Refer to the notebook [figures/nice_umap.ipynb](../figures/nice_umap.ipynb).
+
+### I need to generate gene tokens using pLLMs
+
+To run scPRINT, you can use the option to define the gene tokens using protein language model embeddings of genes. This is done by providing the path to a parquet file of the precomputed set of embeddings for each gene name to scPRINT via "precpt_gene_emb"
+
+-> To generate this file please refer to the notebook [notebooks/generate_gene_embeddings.ipynb](../notebooks/generate_gene_embeddings.ipynb).
+
+### I want to pre-train scPRINT from scratch on my own data
+
+-> Refer to the documentation page [pretrain scprint](pretrain.md)
+
+### Documentation
+
+For more information on usage please see the documentation in [https://www.jkobject.com/scPrint/](https://www.jkobject.com/scPrint/)
+
+### Model Weights
+
+Model weights are available on [hugging face](https://huggingface.co/jkobject/scPRINT/).
 
 ## Development
 
 Read the [CONTRIBUTING.md](CONTRIBUTING.md) file.
 
-### What is included?
+Read the [training runs](https://wandb.ai/ml4ig/scprint_scale/reports/scPRINT-trainings--Vmlldzo4ODIxMjgx?accessToken=80metwx7b08hhourotpskdyaxiflq700xzmzymr6scvkp69agybt79l341tv68hp) document to know more about how pre-training was performed and the its behavior.
 
-- 📃 Documentation structure using [mkdocs](http://www.mkdocs.org)
-- 🧪 Testing structure using [pytest](https://docs.pytest.org/en/latest/)
-  If you want [codecov](https://about.codecov.io/sign-up/) Reports and Automatic Release to [PyPI](https://pypi.org)  
-  On the new repository `settings->secrets` add your `PYPI_API_TOKEN` and `CODECOV_TOKEN` (get the tokens on respective websites)
-- ✅ Code linting using [flake8](https://flake8.pycqa.org/en/latest/)
-- 📊 Code coverage reports using [codecov](https://about.codecov.io/sign-up/)
-- 🛳️ Automatic release to [PyPI](https://pypi.org) using [twine](https://twine.readthedocs.io/en/latest/) and github actions.
+Acknowledgement:
+[python template](https://github.com/rochacbruno/python-project-template)
+[laminDB](https://lamin.ai/)
+[lightning](https://lightning.ai/)
 
+## Work in progress:
 
-acknowledgement:
-[python template](https://github.com/rochacbruno/python-project-template)
-[scGPT]()
-[laminDB]()
+1. remove the triton dependencies
+2. add version with additional labels (tissues, age) and organisms (mouse, zebrafish) and more datasets from cellxgene
+3. version with separate transformer blocks for the encoding part of the bottleneck learning and for the cell embeddings
+4. improve classifier to output uncertainties and topK predictions when unsure
+5. 
+
+Awesome Large Cell Model created by Jeremie Kalfon.
diff --git a/logo.png → docs/logo.png b/logo.png → docs/logo.png