float overflow in dsb normalization #130

javier-marchena-hurtado · 2024-03-01T15:42:32Z

Describe the bug

I ran dsb normalization with muon.prot.pp.dsb(), and I realized that the results for my data did not fully match with the results of the DSBNormalizeProtein() original implementation in R. I printed the means of the empty droplets, that is, the empty_scaled.mean(axis=0) from line 156 in muon/prot/preproc.py.

Much to my surprise, the means were all around 2.2, which is impossible because log(0+10) is 2.3025, that is, even if all values were 0, the mean should be higher than 2.2 (at least 2.3025). I searched this in StackOverflow, which led me to https://stackoverflow.com/questions/22980487/why-is-the-mean-smaller-than-the-minimum-and-why-does-this-change-with-64bit-flo .

So it seems that, due to AnnData using by default float32, this can lead to wrong calculations of the mean. Because for large datasets, the sum of all cell measurements for a protein can be very large, too large for 32 bits. Therefore, I tried changing the anndata.X matrix from float32 to float64, and then running dsb normalization. This seemed to solve the problem, because after the change to float64, the means were around 2.31. Which makes sense since most values of empty cells are 0, and as I said log(10) = 2.3025.

This is especially problematic because the default data type in AnnData is float32. And standard deviations might be even more affected than the means.

So I propose that you coerce the anndata.X of the empty droplets into float64 before calculating the mean and std of empty droplets. This is as easy as adding, in line 144 of muon/prot/preproc.py, empty.X = empty.X.astype(np.float64).

I applied only step I of dsb normalization, without applying step II (without denoising and isotype controls).

To Reproduce

You can download public 10X data used for reproduction here: https://support.10xgenomics.com/single-cell-gene-expression/datasets/3.0.2/5k_pbmc_protein_v3_nextgem

In the following code, I very slightly modified the muon.prot.pp.dsb() function in order to 1) store the means and stds in the AnnData object and 2) to be able to choose between mean subtracting and standardization (mean subtracting and dividing by std). Then, I tried running dsb normalization with float32 (default) and with float64.

import muon
import scanpy as sc
import numpy as np
from typing import Optional, Iterable, Tuple, Union
from numbers import Integral, Real
from warnings import warn
import numpy as np
import pandas as pd
from scipy.sparse import issparse, csc_matrix, csr_matrix
from sklearn.mixture import GaussianMixture
from sklearn.decomposition import PCA
from sklearn.linear_model import LinearRegression
from anndata import AnnData
from mudata import MuData

def dsb(
    data: Union[AnnData, MuData],
    data_raw: Optional[Union[AnnData, MuData]] = None,
    pseudocount: Integral = 10,
    denoise_counts: bool = True,
    isotype_controls: Optional[Iterable[str]] = None,
    empty_counts_range: Optional[Tuple[Real, Real]] = None,
    cell_counts_range: Optional[Tuple[Real, Real]] = None,
    add_layer: bool = False,
    scale_factor: str = "standardize",
    mean_name = "mean_empty",
    std_name = "std_empty",
    random_state: Optional[Union[int, np.random.RandomState, None]] = None,
) -> Union[None, MuData]:
   
    toreturn = None
    if data_raw is None:
        if empty_counts_range is None or cell_counts_range is None:
            raise ValueError(
                "data_raw is None, assuming data is the unfiltered object, but no count ranges provided"
            )
        if max(*empty_counts_range) > min(*cell_counts_range):
            raise ValueError("overlapping count ranges")
        if not isinstance(data, MuData) or "prot" not in data.mod or "rna" not in data.mod:
            raise TypeError(
                "No data_raw given, assuming data is the unfiltered object, but data is not MuData"
                " or does not contain 'prot' and 'rna' modalities"
            )
        if data.mod["rna"].n_obs != data.mod["prot"].n_obs:
            raise ValueError("different numbers of cells in 'rna' and 'prot' modalities.")

        log10umi = np.log10(np.asarray(data.mod["rna"].X.sum(axis=1)).squeeze() + 1)
        empty_idx = np.where(
            (log10umi >= min(*empty_counts_range)) & (log10umi < max(*empty_counts_range))
        )[0]
        cell_idx = np.where(
            (log10umi >= min(*cell_counts_range)) & (log10umi < max(*cell_counts_range))
        )[0]
        cellidx = data.mod["prot"].obs_names[cell_idx]
        empty = data.mod["prot"][empty_idx, :]

        data = data[cellidx, :].copy()
        cells = data.mod["prot"]

        toreturn = data

    elif isinstance(data_raw, AnnData):
        empty = data_raw
    elif isinstance(data_raw, MuData) and "prot" in data_raw.mod:
        empty = data_raw["prot"]
    else:
        raise TypeError("data_raw must be an AnnData or a MuData object with 'prot' modality")

    if isinstance(data, AnnData):
        cells = data
    elif isinstance(data, MuData) and "prot" in data.mod:
        cells = data["prot"]
    else:
        raise TypeError("data must be an AnnData or a MuData object with 'prot' modality")

    if pseudocount < 0:
        raise ValueError("pseudocount cannot be negative")

    if cells.shape[1] != empty.shape[1]:  # this should only be possible if mudata_raw != None
        raise ValueError("data and data_raw have different numbers of proteins")

    if empty_counts_range is None:  # mudata_raw != None
        warn(
            "empty_counts_range values are not provided, treating all the non-cells as empty droplets"
        )
        empty = empty[~empty.obs_names.isin(cells.obs_names)]
    else:
        warn(
            f"empty_counts_range will be deprecated in the future versions",
            DeprecationWarning,
            stacklevel=2,
        )
        if data_raw is not None:
            if not isinstance(data_raw, MuData) or "rna" not in data_raw.mod:
                warn(
                    "data_raw must be a MuData object with 'rna' modality, ignoring empty_counts_range and treating all the non-cells as empty droplets"
                )
                empty = empty[~empty.obs_names.isin(cells.obs_names)]
            else:
                # data_raw is a MuData with 'rna' modality and empty_counts_range values are provided
                log10umi = np.log10(np.asarray(data_raw.mod["rna"].X.sum(axis=1)).squeeze() + 1)
                bc_umis = pd.DataFrame({"log10umi": log10umi}, index=data_raw.mod["rna"].obs_names)
                empty_droplets = bc_umis.query(
                    f"log10umi >= {min(*empty_counts_range)} & log10umi < {max(*empty_counts_range)}"
                ).index.values

                empty_len_orig = len(empty_droplets)
                empty_droplets = np.array([i for i in empty_droplets if i not in cells.obs_names])
                empty_len = len(empty_droplets)
                if empty_len != empty_len_orig:
                    warn(
                        f"Dropping {empty_len_orig - empty_len} empty droplets as they are already defined as cells"
                    )
                empty = empty[empty_droplets].copy()

    if data_raw is not None and cell_counts_range is not None:
        warn("cell_counts_range values are ignored since cells are provided in data")

    empty_scaled = (
        np.log(empty.X + pseudocount)
        if not issparse(empty.X)
        else np.log(empty.X.toarray() + pseudocount)
    )
    cells_scaled = (
        np.log(cells.X + pseudocount)
        if not issparse(cells.X)
        else np.log(cells.X.toarray() + pseudocount)
    )
    
    if scale_factor == "standardize":
        cells_scaled = (cells_scaled - empty_scaled.mean(axis=0)) / empty_scaled.std(axis=0)
    elif scale_factor == "mean_subtract":
        cells_scaled = cells_scaled - empty_scaled.mean(axis=0)
    else:
        raise ValueError("Unknown scale factor")

    if denoise_counts:
        bgmeans = np.empty(cells_scaled.shape[0], np.float32)
        # init_params needs to be random, otherwise fitted variance for one of the n_components
        # sometimes goes to 0
        sharedvar = GaussianMixture(
            n_components=2, covariance_type="tied", init_params="random", random_state=random_state
        )
        separatevar = GaussianMixture(
            n_components=2, covariance_type="full", init_params="random", random_state=random_state
        )
        for c in range(cells_scaled.shape[0]):
            sharedvar.fit(cells_scaled[c, :, np.newaxis])
            separatevar.fit(cells_scaled[c, :, np.newaxis])

            if sharedvar.bic(cells_scaled[c, :, np.newaxis]) < separatevar.bic(
                cells_scaled[c, :, np.newaxis]
            ):
                bgmeans[c] = np.min(sharedvar.means_)
            else:
                bgmeans[c] = np.min(separatevar.means_)

        if isotype_controls is not None:
            pca = PCA(n_components=1, whiten=True)
            ctrl_idx = np.where(cells.var_names.isin(set(isotype_controls)))[0]
            if len(ctrl_idx) < len(isotype_controls):
                warn("Some isotype controls are not present in the data.")
            covar = pca.fit_transform(
                np.hstack((cells_scaled[:, ctrl_idx], bgmeans.reshape(-1, 1)))
            )
        else:
            covar = bgmeans[:, np.newaxis]

        reg = LinearRegression(fit_intercept=True, copy_X=False)
        reg.fit(covar, cells_scaled)

        cells_scaled -= reg.predict(covar) - reg.intercept_
    if add_layer:
        cells.layers["dsb"] = cells_scaled
    else:
        cells.X = cells_scaled
        
    cells.uns[mean_name] = empty_scaled.mean(axis=0)
    cells.uns[std_name] = empty_scaled.std(axis=0)
        
    return toreturn


raw = sc.read_10x_h5("/home/marchena/Downloads/5k_pbmc_protein_v3_nextgem_raw_feature_bc_matrix.h5", gex_only=False)
filtered = sc.read_10x_h5("/home/marchena/Downloads/5k_pbmc_protein_v3_nextgem_filtered_feature_bc_matrix.h5", gex_only=False)

# Add cellranger background as .obs.
cellranger_background = [barcode not in filtered.obs_names for barcode in raw.obs_names]
raw.obs["cellranger_background"] = np.array(cellranger_background).astype(str)

# Separate into RNA and protein.
raw_rna = raw[:, raw.var.feature_types == "Gene Expression"].copy()
raw_protein = raw[:, raw.var.feature_types == "Antibody Capture"].copy()

# raw_protein.X = raw_protein.X.astype(np.float64)  # UNCOMMENT THIS LINE IN ORDER TO USE FLOAT64!

# Filter to protein of real cells (CellRanger defined cells) and background (empty droplets)·
filtered_protein = raw_protein[filtered.obs_names, :].copy()
# filtered_protein = filtered_protein[:, sorted(filtered_protein.var_names)].copy() # Sort protein names alphabetically.
cellranger_background_protein = raw_protein[raw_protein.obs["cellranger_background"] == True].copy()

# Performing DSB normalization
print("performing DSB normalization")
# DSB normalization with CellRanger background.
dsb(data=filtered_protein, data_raw=raw_protein, add_layer=True, denoise_counts=False, scale_factor="standardize")
filtered_protein.layers["dsb_cellranger_background_standardized"] = filtered_protein.layers["dsb"]
# DSB normalization with CellRanger background.
dsb(data=filtered_protein, data_raw=raw_protein, add_layer=True, denoise_counts=False, scale_factor="mean_subtract")
filtered_protein.layers["dsb_cellranger_background_mean_subtracted"] = filtered_protein.layers["dsb"]

dsb(data=filtered_protein, data_raw=raw_protein, add_layer=True, denoise_counts=False)

print(filtered_protein.uns)

Means with float32:

('mean_empty', array([2.1995733, 2.2005668, 2.1973336, 2.20299 , 2.2034488, 2.1984806,
2.1966617, 2.1966765, 2.1987805, 2.1967235, 2.202042 , 2.200193 ,
2.1957297, 2.2018113, 2.197363 , 2.196647 , 2.197452 , 2.2016175,
2.195974 , 2.196987 , 2.1979306, 2.1962392, 2.199787 , 2.1964517,
2.1975152, 2.1963153, 2.1968145, 2.203014 , 2.1961546, 2.1962936,
2.1961162, 2.1968863], dtype=float32)

Means with float64:

'mean_empty', array([2.30580925, 2.30669376, 2.30415037, 2.3086791 , 2.30932639,
2.30478346, 2.30370452, 2.30370842, 2.30499958, 2.30373097,
2.30787521, 2.30589283, 2.30259115, 2.30726847, 2.30418295,
2.30367811, 2.30425233, 2.30698558, 2.30301876, 2.30395364,
2.30448475, 2.3033102 , 2.3055071 , 2.30348507, 2.30424637,
2.30339363, 2.30380386, 2.30861361, 2.30324365, 2.30335848,
2.30319787, 2.30382639]))

Expected behaviour

I expect that means of protein measurements across cells are higher than 2.3026, which is the minimum value.

System

OS: Debian 12
Python version 3.10.9
Versions of libraries involved:
muon 0.1.3
numpy 1.23.5

The text was updated successfully, but these errors were encountered:

ivirshup · 2024-03-04T13:59:22Z

@gtca, FWIW in scanpy we avoid some float32 inaccuracy in mean/ variance calculation by accumulating the float32 values into float64

javier-marchena-hurtado added the bug Something isn't working label Mar 1, 2024

javier-marchena-hurtado mentioned this issue Mar 1, 2024

dsb normalization results do not replicate results with the original DSBNormalizeProtein function in R #131

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

float overflow in dsb normalization #130

float overflow in dsb normalization #130

javier-marchena-hurtado commented Mar 1, 2024 •

edited

Loading

ivirshup commented Mar 4, 2024

float overflow in dsb normalization #130

float overflow in dsb normalization #130

Comments

javier-marchena-hurtado commented Mar 1, 2024 • edited Loading

ivirshup commented Mar 4, 2024

javier-marchena-hurtado commented Mar 1, 2024 •

edited

Loading