Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(external): implement METHLYANVI for scBS-seq data #3066

Open
wants to merge 40 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 39 commits
Commits
Show all changes
40 commits
Select commit Hold shift + click to select a range
4887899
Retrieve methylation levels for specified context with MethylVI
Nov 21, 2024
4a40526
Add test
Nov 21, 2024
6142b93
Initial MethylANVI commit
Dec 3, 2024
445ecab
Merge branch 'main' into external/methylanvi
Dec 3, 2024
d038b67
Fix formatting
Dec 3, 2024
24c8f99
MethylANVI docs
Dec 4, 2024
3ab3f41
Doc fixes
Dec 4, 2024
5309fbc
Fix test, factor out reconstruction loss
Dec 4, 2024
d4a12e7
Fix test (again)
Dec 4, 2024
fa43d6b
Fix methylanVI test
Dec 4, 2024
b19fa17
Add MuData labels field
Dec 4, 2024
ceb880a
Fix test
Dec 4, 2024
83a7e60
Mixin for Semisupervised training
Dec 4, 2024
19415c4
Fix import
Dec 4, 2024
3f13ce4
Fix tests
Dec 4, 2024
52da5f0
Fix scANVI modality key handling
Dec 4, 2024
b7925cf
Refactor getting mod key
Dec 4, 2024
23af079
BSSeq Mixin
Dec 4, 2024
f5f49c1
Update test
Dec 4, 2024
75bb701
Merge branch 'main' into external/methylanvi
ori-kron-wis Dec 22, 2024
6a97164
Merge branch 'main' into external/methylanvi
ori-kron-wis Dec 22, 2024
f7ceb5d
Merge branch 'main' into external/methylanvi
ori-kron-wis Dec 24, 2024
921888a
Merge branch 'main' into external/methylanvi
ori-kron-wis Dec 24, 2024
3e25916
Merge branch 'main' into external/methylanvi
ori-kron-wis Dec 24, 2024
c069a62
Merge branch 'main' into external/methylanvi
ori-kron-wis Dec 31, 2024
a1c8ee4
Merge branch 'main' into external/methylanvi
ori-kron-wis Dec 31, 2024
49a96a4
Merge branch 'main' into external/methylanvi
ethanweinberger Jan 3, 2025
729cf77
Split MethylVI/MethylANVI models/modules into separate files
Jan 3, 2025
6373970
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jan 3, 2025
8298ec5
Remove classifier logits check
Jan 3, 2025
c1f0f43
Fix description of METHYLANVAE
Jan 3, 2025
3d0cef3
Revert SemisupervisedTrainingMixin
Jan 3, 2025
8716574
Revert changes to `_training_mixin.py` file
Jan 3, 2025
734fe93
Remove outdated import
Jan 3, 2025
df61c91
Revert scANVI changes
Jan 3, 2025
1ac0a6c
Remove erroneous comment
Jan 5, 2025
1057b6d
Add back in SemisupervisedMixin for MethylANVI
Jan 5, 2025
26a2d01
Classify function for mixin
Jan 6, 2025
3ef7032
Classify function
Jan 6, 2025
074370a
Merge branch 'main' into external/methylanvi
ori-kron-wis Jan 8, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions docs/user_guide/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -82,6 +82,9 @@ scvi-tools is composed of models that can perform one or many analysis tasks. In
* - :doc:`/user_guide/models/methylvi`
- Dimensionality reduction, removal of unwanted variation, integration across replicates, donors, and technologies, differential methylation, imputation, normalization of other cell- and sample-level confounding factors
- :cite:p:`Weinberger2023a`
* - :doc:`/user_guide/models/methylanvi`
- MethylVI tasks along with cell type label transfer from reference, seed labeling
- :cite:p:`Weinberger2023a`
```

## Multimodal analysis
Expand Down
145 changes: 145 additions & 0 deletions docs/user_guide/models/methylanvi.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,145 @@
# MethylANVI

**MethylANVI** [^ref1] (Python class {class}`~scvi.external.METHYLANVI`) is a semi-supervised generative model of scBS-seq data.
Similar to how scANVI extends scVI, MethylANVI can be treated as an extension of MethylVI that can leverage cell type annotations
for a subset of the cells present in the data sets to infer the states of the rest of the cells

The advantages of MethylANVI are:

- Comprehensive in capabilities.
- Scalable to very large datasets (>1 million cells).

The limitations of MethylANVI include:

- Effectively requires a GPU for fast inference.
- Latent space is not interpretable, unlike that of a linear method.
- May not scale to very large number of cell types.

```{topic} Tutorials:

- Work in progress.
```

## Preliminaries

MethylANVI takes as input scBS-seq count matrices representing methylation measurements aggregated over pre-defined
regions of interest (e.g. gene bodies, known regulatory regions, etc.). Depending on the system being investigated,
such measurements may be separated based on methylation context (e.g. CpG methylation versus non-CpG methylation).
For each context, MethylANVI accepts two count matrices as input $Y^{C}_{mc}$ and $Y^{C}_{cov}$. Here $C$ refers to
an arbitrary methylation context, and each of these matrices has data from $N$ cells and $M$ genomic regions.
Each entry in $Y_{cov}$ represents the _total_ number of cytosines profiled at a given region in a cell, while the
entries in $Y_{mc}$ denote the number of _methylated_ cytosines in a region for a cell.

In addition to methylation measurements, MethylANVI takes as input a vector of partially observed cell-type labels $\mathbf{l}$,
where $L$ denotes the total number of cell types. Additionally, a vector of categorical covariates $S$, representing batch,
donor, etc, is an optional input to the model.

## Generative process

MethylANVI posits that the observed number of methylated cytosines in context $C$ for cell $i$ in region $j$,
$y^{C}_{ij}$, is generated by the following process:

```{math}
:nowrap: true

\begin{align}
l_i &\sim \text{Categorical}(1/L, \ldots, 1/L) \\
u_i &\sim \mathcal{N}(0, I_d) \\
z_{i} &\sim \mathcal{N}(f_z^{\mu}(u_i, l_i), f_z^{\sigma}(u_i, l_i)) \\
\mu^{C}_{ij} &= f_{\theta^{C}}(z_{i}, s_i)_j \\
p^{C}_{ijk} &\sim \text{Beta}(\mu^{C}_{ij}, \gamma^{C}_j) \\
y^{C}_{ijk} &\sim \text{Ber}(p^{C}_{ijk}) \\
y^{C}_{ij} &= \sum_{k}y_{ijk}
\end{align}
```

Equivalently, we can express this process more compactly as

```{math}
:nowrap: true

\begin{align}
l_i &\sim \text{Categorical}(1/L, \ldots, 1/L) \\
u_i &\sim \mathcal{N}(0, I_d) \\
z_{i} &\sim \mathcal{N}(f_z^{\mu}(u_i, l_i), f_z^{\sigma}(u_i, l_i)) \\
z_{i} &\sim \mathcal{N}(0, I_d) \\
\mu^{C}_{ij} &= f_{\theta^{C}}(z_{i}, s_i)_j \\
y^{C}_{ij} &\sim \text{BetaBinomial}(n^{C}_{ij}, \mu^{C}_{ij}, \gamma^{C}_{j})
\end{align}
```

We assume no prior knowledge on the distribution of cell types in the data (i.e., we place a uniform prior on the
distribution of cell type labels). Within-cell-type variations $u_i$ are assumed to follow a fixed standard normal distribution,
while the distribution over the cell-type-aware latent variables $z_i$ depend on the learnable neural networks $f_z^{\mu}$ and
$f_z^{\sigma}$. The variables $z_i$ summarize a cell's state as a low-dimensional vector, and have a similar interpretation
as with MethylVI. However, by incorporating cell type labels into the model, MethylANVI may learn a better structured
latent space compared to MethylVI.

The remainder of the model closely follows MethylVI. In particular, observed methylated cytosine counts are assumed
to follow a beta-binomial distribution conditioned on a cell's underlying state $z_i$ as well as batch covariates $s_i$.

In addition to the variables defined for {doc}`/user_guide/models/methylvi`, we have the following variables for MethylANVI:

```{eval-rst}
.. list-table::
:widths: 20 90 15
:header-rows: 1

* - Latent variable
- Description
- Code variable (if different)
* - :math:`l_i \in \Delta^{L-1}`
- Cell type label
- ``y``
* - :math:`z_i \in \mathbb{R}^d`
- Latent cell state
- ``z_1``
* - :math:`u_i \in \mathbb{R}^{d}`
- Latent cell-type specific state
- ``z_2``
```

## Inference

MethylANVI posits the following factorized distribution for posterior inference

:nowrap: true

\begin{align}
q_\phi(z_i, u_i, c_i \mid y_i, n_i, s_i)
=
q_\phi(z_i \mid y_i, n_i, s_i)
q_\phi(c_i \mid z_i)
q_\phi(u_i \mid c_i, z_i)
\end{align}

Each of the individual variational distributions in our factorized expression is parameterized by neural
networks. Here $q_\phi(z_i \mid y_i, n_i, s_i)$ and $q_\phi(u_i \mid c_i, z_i)$ follow Gaussian distributions, while
$q_\phi(c_i \mid z_i)$ represents a Categorical distribution over cell types. Notably, $q_\phi(c_i \mid z_i)$ can be
leveraged post-training to predict cell types for an unlabeled cell. For this classification procedure, under the hood
we use as input the mean of the variational distribution $q_\phi(z_i \mid y_i, n_i, s_i)$.

## Training details

MethylANVI optimizes two evidence lower bounds (ELBOs) on the log evidence, with the two bounds corresponding to labeled
and unlabeled cells. These bounds largely mirror those of scANVI, with appropriate substitutions made to account for scBS-seq
observations. We refer the reader to the {doc}`/user_guide/models/scanvi` documentation for further details.

## Tasks

MethylANVI can perform the same tasks as MethylVI (see {doc}`/user_guide/models/methylvi`). In addition, MethylANVI can
do the following:

### Cell type label prediction

For cell type label prediction, MethylANVI returns the distribution $q_{\phi}(l_i \mid z_i)$ in the following
function:

```
>>> mdata.obs["methylanvi_prediction"] = model.predict()
```

[^ref1]:
Ethan Weinberger and Su-In Lee (2021),
_A deep generative model of single-cell methylomic data_,
[OpenReview](https://openreview.net/forum?id=Mg2DM0F3AY).
3 changes: 2 additions & 1 deletion src/scvi/data/fields/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@
from ._layer_field import LayerField, MuDataLayerField
from ._mudata import BaseMuDataWrapperClass, MuDataWrapper
from ._protein import MuDataProteinLayerField, ProteinObsmField
from ._scanvi import LabelsWithUnlabeledObsField
from ._scanvi import LabelsWithUnlabeledObsField, MuDataLabelsWithUnlabeledObsField
from ._uns_field import StringUnsField

__all__ = [
Expand Down Expand Up @@ -59,5 +59,6 @@
"MuDataCategoricalJointVarField",
"ProteinObsmField",
"LabelsWithUnlabeledObsField",
"MuDataLabelsWithUnlabeledObsField",
"StringUnsField",
]
4 changes: 4 additions & 0 deletions src/scvi/data/fields/_scanvi.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@
from scvi.data._utils import _make_column_categorical, _set_data_in_registry

from ._dataframe_field import CategoricalObsField
from ._mudata import MuDataWrapper


class LabelsWithUnlabeledObsField(CategoricalObsField):
Expand Down Expand Up @@ -107,3 +108,6 @@ def transfer_field(
)
mapping = transfer_state_registry[self.CATEGORICAL_MAPPING_KEY]
return self._remap_unlabeled_to_final_category(adata_target, mapping)


MuDataLabelsWithUnlabeledObsField = MuDataWrapper(LabelsWithUnlabeledObsField)
1 change: 1 addition & 0 deletions src/scvi/dataloaders/_data_splitting.py
Original file line number Diff line number Diff line change
Expand Up @@ -410,6 +410,7 @@ def __init__(
adata_manager.adata,
adata_manager.data_registry.labels.attr_name,
labels_state_registry.original_key,
mod_key=getattr(self.adata_manager.data_registry.labels, "mod_key", None),
).ravel()
self.unlabeled_category = labels_state_registry.unlabeled_category
self._unlabeled_indices = np.argwhere(labels == self.unlabeled_category).ravel()
Expand Down
1 change: 1 addition & 0 deletions src/scvi/dataloaders/_semi_dataloader.py
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,7 @@ def __init__(
adata_manager.adata,
adata_manager.data_registry.labels.attr_name,
labels_state_registry.original_key,
mod_key=getattr(adata_manager.data_registry.labels, "mod_key", None),
).ravel()

# save a nested list of the indices per labeled category
Expand Down
3 changes: 2 additions & 1 deletion src/scvi/external/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
from .contrastivevi import ContrastiveVI
from .decipher import Decipher
from .gimvi import GIMVI
from .methylvi import METHYLVI
from .methylvi import METHYLANVI, METHYLVI
from .mrvi import MRVI
from .poissonvi import POISSONVI
from .scar import SCAR
Expand All @@ -27,4 +27,5 @@
"VELOVI",
"MRVI",
"METHYLVI",
"METHYLANVI",
]
7 changes: 4 additions & 3 deletions src/scvi/external/methylvi/__init__.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
from ._base_components import DecoderMETHYLVI
from ._constants import METHYLVI_REGISTRY_KEYS
from ._model import METHYLVI as METHYLVI
from ._module import METHYLVAE
from ._methylanvi_model import METHYLANVI as METHYLANVI
from ._methylvi_model import METHYLVI as METHYLVI
from ._methylvi_module import METHYLVAE

__all__ = ["METHYLVI_REGISTRY_KEYS", "DecoderMETHYLVI", "METHYLVAE", "METHYLVI"]
__all__ = ["METHYLVI_REGISTRY_KEYS", "DecoderMETHYLVI", "METHYLVAE", "METHYLVI", "METHYLANVI"]
Loading
Loading