Custom dataloader registry support #2932

ori-kron-wis · 2024-08-07T12:57:55Z

No description provided.

…try' into ori-2907-custom-dataloader-registry

…module / registry big change

for more information, see https://pre-commit.ci

…un, we will later adjust this file

codecov · 2024-08-11T11:53:18Z

Codecov Report

Attention: Patch coverage is 54.10628% with 95 lines in your changes missing coverage. Please review.

Project coverage is 83.90%. Comparing base (6bb8d8c) to head (f94f7fa).

Files with missing lines	Patch %	Lines
src/scvi/model/base/_base_model.py	41.17%	70 Missing ⚠️
src/scvi/model/_scvi.py	56.00%	11 Missing ⚠️
src/scvi/model/base/_archesmixin.py	75.00%	8 Missing ⚠️
src/scvi/model/_scanvi.py	77.77%	4 Missing ⚠️
src/scvi/model/base/_save_load.py	75.00%	1 Missing ⚠️
src/scvi/model/base/_training_mixin.py	50.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2932      +/-   ##
==========================================
- Coverage   84.81%   83.90%   -0.92%     
==========================================
  Files         173      173              
  Lines       14793    14920     +127     
==========================================
- Hits        12547    12518      -29     
- Misses       2246     2402     +156

Files with missing lines	Coverage Δ
src/scvi/data/_utils.py	`86.12% <100.00%> (+0.58%)`	⬆️
src/scvi/external/stereoscope/_model.py	`92.40% <ø> (ø)`
src/scvi/external/stereoscope/_module.py	`96.33% <ø> (ø)`
src/scvi/model/_amortizedlda.py	`94.11% <ø> (ø)`
src/scvi/model/_autozi.py	`95.40% <ø> (ø)`
src/scvi/model/_condscvi.py	`95.74% <ø> (ø)`
src/scvi/model/_jaxscvi.py	`92.30% <ø> (ø)`
src/scvi/model/_linear_scvi.py	`94.87% <ø> (ø)`
src/scvi/model/_multivi.py	`72.26% <ø> (ø)`
src/scvi/model/_peakvi.py	`87.09% <ø> (ø)`
... and 7 more

... and 2 files with indirect coverage changes

…action

and fix the test for custom dataloaders

canergen · 2024-10-11T03:37:51Z

src/scvi/model/_scvi.py

+    @setup_anndata_dsp.dedent
+    def setup_datamodule(
+        cls,
+        datamodule,  # TODO: what to put here?


It should be pytorch.DataLoader right? Martin has done typing for it in the current code.

canergen · 2024-10-11T03:38:51Z

src/scvi/model/_scvi.py

+                    "state_registry": {
+                        "n_obs": datamodule.n_obs,
+                        "n_vars": datamodule.n_vars,
+                        "column_names": [str(i) for i in column_names],  # TODO: from adata (czi)?


Not following?

canergen · 2024-10-11T03:40:14Z

src/scvi/model/base/_archesmixin.py

-                _validate_var_names(adata[modality], var_names[modality])
+                    logger.debug("Subsetting query vars to reference vars.")
+                    adata._inplace_subset_var(var_names)
+                _validate_var_names(adata, var_names)


We need to verify for dataloaders that the gene names are matching.

And the order.

canergen · 2024-10-11T03:41:57Z

src/scvi/model/base/_archesmixin.py

-            logger.debug("Subsetting query vars to reference vars.")
-            adata._inplace_subset_var(var_names)
-        _validate_var_names(adata, var_names)
+            registry = attr_dict.pop("registry_")


This check and the ones below are independent of datamodule or AnnData, right? Remove the indent.

I am not sure why every code here is displayed as modified?

canergen · 2024-10-11T03:43:17Z

src/scvi/model/base/_archesmixin.py

@@ -202,7 +215,7 @@ def prepare_query_anndata(
        Query adata ready to use in `load_query_data` unless `return_reference_var_names`
        in which case a pd.Index of reference var names is returned.
        """
-        _, var_names, _ = _get_loaded_data(reference_model, device="cpu")
+        _, var_names, _ = _get_loaded_data(reference_model, device="cpu", adata=adata)


Does this work with a dataloader?

canergen · 2024-10-11T03:43:58Z

src/scvi/model/base/_archesmixin.py

@@ -350,15 +363,15 @@ def requires_grad(key):
            par.requires_grad = False


-def _get_loaded_data(reference_model, device=None):
+def _get_loaded_data(reference_model, device=None, adata=None):


Why do we need adata here?

canergen · 2024-10-11T03:45:14Z

src/scvi/model/base/_base_model.py

+            self.registry_ = registry
+            self.summary_stats = _get_summary_stats_from_registry(registry)
+        elif self.__class__.__name__ == "GIMVI":
+            # note some models do accept empty registry/adata (e.g: gimvi)


I'm not following this one. What is the exception with GIMVI?

canergen · 2024-10-11T03:46:28Z

src/scvi/model/base/_base_model.py

+        else:
+            return self._adata_manager.get_from_registry(registry_key)
+
+    # def get_from_registry(self, registry_key: str) -> np.ndarray | pd.DataFrame:


what's this?

canergen · 2024-10-11T03:47:21Z

src/scvi/model/base/_base_model.py

        else:
            # Case where correct AnnDataManager is found, replay registration as necessary.
            adata_manager.validate()

        return adata

+    def transfer_fields(self, adata: AnnOrMuData, **kwargs) -> AnnData:
+        """Transfer fields from a model to an AnnData object."""
+        if self.adata:


where do we need transfer_fields? can we make it work with datamodule?

canergen · 2024-10-11T03:48:08Z

src/scvi/model/base/_base_model.py

@@ -627,8 +711,7 @@ def save(

        # save the model state dict and the trainer state dict only
        model_state_dict = self.module.state_dict()
-
-        var_names = _get_var_names(self.adata, legacy_mudata_format=legacy_mudata_format)
+        var_names = self.get_var_names(legacy_mudata_format=legacy_mudata_format)


do we need two get_var_names function?

canergen · 2024-10-11T03:48:56Z

src/scvi/model/base/_base_model.py

+                    "Saved model does not contain original setup inputs. "
+                    "Cannot load the original setup."
+                )
+            _validate_var_names(adata, var_names)


should be validated also for a dataloader.

canergen · 2024-10-11T03:49:59Z

src/scvi/model/base/_base_model.py

+
+    def get_state_registry(self, registry_key: str) -> attrdict:
+        """Returns the state registry for the AnnDataField registered with this instance."""
+        return attrdict(self.registry_[_FIELD_REGISTRIES_KEY][registry_key][_STATE_REGISTRY_KEY])


Does this work with dataloader. Documentation should be updated then.

canergen · 2024-10-11T03:51:30Z

src/scvi/model/base/_save_load.py

@@ -133,7 +133,10 @@ def _initialize_model(cls, adata, attr_dict):
    if "pretrained_model" in non_kwargs.keys():
        non_kwargs.pop("pretrained_model")

-    model = cls(adata, **non_kwargs, **kwargs)
+    if not adata:
+        adata = None


Is adata false here? Do we need a default value for registry?

canergen · 2024-10-11T03:52:00Z

src/scvi/model/base/_training_mixin.py

        if max_epochs is None:
-            if datamodule is None:
+            if self.adata is not None:


We should take here n_obs from summary stats to make it compatible with a dataloader.

See below we don't need the if statement.

canergen · 2024-10-11T04:12:04Z