You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I ran dsb normalization with muon.prot.pp.dsb(), and I realized that the results for my data did not fully match with the results of the DSBNormalizeProtein() original implementation in R. I printed the means of the empty droplets, that is, the empty_scaled.mean(axis=0) from line 156 in muon/prot/preproc.py.
So it seems that, due to AnnData using by default float32, this can lead to wrong calculations of the mean. Because for large datasets, the sum of all cell measurements for a protein can be very large, too large for 32 bits. Therefore, I tried changing the anndata.X matrix from float32 to float64, and then running dsb normalization. This seemed to solve the problem, because after the change to float64, the means were around 2.31. Which makes sense since most values of empty cells are 0, and as I said log(10) = 2.3025.
This is especially problematic because the default data type in AnnData is float32. And standard deviations might be even more affected than the means.
So I propose that you coerce the anndata.X of the empty droplets into float64 before calculating the mean and std of empty droplets. This is as easy as adding, in line 144 of muon/prot/preproc.py, empty.X = empty.X.astype(np.float64).
I applied only step I of dsb normalization, without applying step II (without denoising and isotype controls).
In the following code, I very slightly modified the muon.prot.pp.dsb() function in order to 1) store the means and stds in the AnnData object and 2) to be able to choose between mean subtracting and standardization (mean subtracting and dividing by std). Then, I tried running dsb normalization with float32 (default) and with float64.
import muon
import scanpy as sc
import numpy as np
from typing import Optional, Iterable, Tuple, Union
from numbers import Integral, Real
from warnings import warn
import numpy as np
import pandas as pd
from scipy.sparse import issparse, csc_matrix, csr_matrix
from sklearn.mixture import GaussianMixture
from sklearn.decomposition import PCA
from sklearn.linear_model import LinearRegression
from anndata import AnnData
from mudata import MuData
def dsb(
data: Union[AnnData, MuData],
data_raw: Optional[Union[AnnData, MuData]] = None,
pseudocount: Integral = 10,
denoise_counts: bool = True,
isotype_controls: Optional[Iterable[str]] = None,
empty_counts_range: Optional[Tuple[Real, Real]] = None,
cell_counts_range: Optional[Tuple[Real, Real]] = None,
add_layer: bool = False,
scale_factor: str = "standardize",
mean_name = "mean_empty",
std_name = "std_empty",
random_state: Optional[Union[int, np.random.RandomState, None]] = None,
) -> Union[None, MuData]:
toreturn = None
if data_raw is None:
if empty_counts_range is None or cell_counts_range is None:
raise ValueError(
"data_raw is None, assuming data is the unfiltered object, but no count ranges provided"
)
if max(*empty_counts_range) > min(*cell_counts_range):
raise ValueError("overlapping count ranges")
if not isinstance(data, MuData) or "prot" not in data.mod or "rna" not in data.mod:
raise TypeError(
"No data_raw given, assuming data is the unfiltered object, but data is not MuData"
" or does not contain 'prot' and 'rna' modalities"
)
if data.mod["rna"].n_obs != data.mod["prot"].n_obs:
raise ValueError("different numbers of cells in 'rna' and 'prot' modalities.")
log10umi = np.log10(np.asarray(data.mod["rna"].X.sum(axis=1)).squeeze() + 1)
empty_idx = np.where(
(log10umi >= min(*empty_counts_range)) & (log10umi < max(*empty_counts_range))
)[0]
cell_idx = np.where(
(log10umi >= min(*cell_counts_range)) & (log10umi < max(*cell_counts_range))
)[0]
cellidx = data.mod["prot"].obs_names[cell_idx]
empty = data.mod["prot"][empty_idx, :]
data = data[cellidx, :].copy()
cells = data.mod["prot"]
toreturn = data
elif isinstance(data_raw, AnnData):
empty = data_raw
elif isinstance(data_raw, MuData) and "prot" in data_raw.mod:
empty = data_raw["prot"]
else:
raise TypeError("data_raw must be an AnnData or a MuData object with 'prot' modality")
if isinstance(data, AnnData):
cells = data
elif isinstance(data, MuData) and "prot" in data.mod:
cells = data["prot"]
else:
raise TypeError("data must be an AnnData or a MuData object with 'prot' modality")
if pseudocount < 0:
raise ValueError("pseudocount cannot be negative")
if cells.shape[1] != empty.shape[1]: # this should only be possible if mudata_raw != None
raise ValueError("data and data_raw have different numbers of proteins")
if empty_counts_range is None: # mudata_raw != None
warn(
"empty_counts_range values are not provided, treating all the non-cells as empty droplets"
)
empty = empty[~empty.obs_names.isin(cells.obs_names)]
else:
warn(
f"empty_counts_range will be deprecated in the future versions",
DeprecationWarning,
stacklevel=2,
)
if data_raw is not None:
if not isinstance(data_raw, MuData) or "rna" not in data_raw.mod:
warn(
"data_raw must be a MuData object with 'rna' modality, ignoring empty_counts_range and treating all the non-cells as empty droplets"
)
empty = empty[~empty.obs_names.isin(cells.obs_names)]
else:
# data_raw is a MuData with 'rna' modality and empty_counts_range values are provided
log10umi = np.log10(np.asarray(data_raw.mod["rna"].X.sum(axis=1)).squeeze() + 1)
bc_umis = pd.DataFrame({"log10umi": log10umi}, index=data_raw.mod["rna"].obs_names)
empty_droplets = bc_umis.query(
f"log10umi >= {min(*empty_counts_range)} & log10umi < {max(*empty_counts_range)}"
).index.values
empty_len_orig = len(empty_droplets)
empty_droplets = np.array([i for i in empty_droplets if i not in cells.obs_names])
empty_len = len(empty_droplets)
if empty_len != empty_len_orig:
warn(
f"Dropping {empty_len_orig - empty_len} empty droplets as they are already defined as cells"
)
empty = empty[empty_droplets].copy()
if data_raw is not None and cell_counts_range is not None:
warn("cell_counts_range values are ignored since cells are provided in data")
empty_scaled = (
np.log(empty.X + pseudocount)
if not issparse(empty.X)
else np.log(empty.X.toarray() + pseudocount)
)
cells_scaled = (
np.log(cells.X + pseudocount)
if not issparse(cells.X)
else np.log(cells.X.toarray() + pseudocount)
)
if scale_factor == "standardize":
cells_scaled = (cells_scaled - empty_scaled.mean(axis=0)) / empty_scaled.std(axis=0)
elif scale_factor == "mean_subtract":
cells_scaled = cells_scaled - empty_scaled.mean(axis=0)
else:
raise ValueError("Unknown scale factor")
if denoise_counts:
bgmeans = np.empty(cells_scaled.shape[0], np.float32)
# init_params needs to be random, otherwise fitted variance for one of the n_components
# sometimes goes to 0
sharedvar = GaussianMixture(
n_components=2, covariance_type="tied", init_params="random", random_state=random_state
)
separatevar = GaussianMixture(
n_components=2, covariance_type="full", init_params="random", random_state=random_state
)
for c in range(cells_scaled.shape[0]):
sharedvar.fit(cells_scaled[c, :, np.newaxis])
separatevar.fit(cells_scaled[c, :, np.newaxis])
if sharedvar.bic(cells_scaled[c, :, np.newaxis]) < separatevar.bic(
cells_scaled[c, :, np.newaxis]
):
bgmeans[c] = np.min(sharedvar.means_)
else:
bgmeans[c] = np.min(separatevar.means_)
if isotype_controls is not None:
pca = PCA(n_components=1, whiten=True)
ctrl_idx = np.where(cells.var_names.isin(set(isotype_controls)))[0]
if len(ctrl_idx) < len(isotype_controls):
warn("Some isotype controls are not present in the data.")
covar = pca.fit_transform(
np.hstack((cells_scaled[:, ctrl_idx], bgmeans.reshape(-1, 1)))
)
else:
covar = bgmeans[:, np.newaxis]
reg = LinearRegression(fit_intercept=True, copy_X=False)
reg.fit(covar, cells_scaled)
cells_scaled -= reg.predict(covar) - reg.intercept_
if add_layer:
cells.layers["dsb"] = cells_scaled
else:
cells.X = cells_scaled
cells.uns[mean_name] = empty_scaled.mean(axis=0)
cells.uns[std_name] = empty_scaled.std(axis=0)
return toreturn
raw = sc.read_10x_h5("/home/marchena/Downloads/5k_pbmc_protein_v3_nextgem_raw_feature_bc_matrix.h5", gex_only=False)
filtered = sc.read_10x_h5("/home/marchena/Downloads/5k_pbmc_protein_v3_nextgem_filtered_feature_bc_matrix.h5", gex_only=False)
# Add cellranger background as .obs.
cellranger_background = [barcode not in filtered.obs_names for barcode in raw.obs_names]
raw.obs["cellranger_background"] = np.array(cellranger_background).astype(str)
# Separate into RNA and protein.
raw_rna = raw[:, raw.var.feature_types == "Gene Expression"].copy()
raw_protein = raw[:, raw.var.feature_types == "Antibody Capture"].copy()
# raw_protein.X = raw_protein.X.astype(np.float64) # UNCOMMENT THIS LINE IN ORDER TO USE FLOAT64!
# Filter to protein of real cells (CellRanger defined cells) and background (empty droplets)·
filtered_protein = raw_protein[filtered.obs_names, :].copy()
# filtered_protein = filtered_protein[:, sorted(filtered_protein.var_names)].copy() # Sort protein names alphabetically.
cellranger_background_protein = raw_protein[raw_protein.obs["cellranger_background"] == True].copy()
# Performing DSB normalization
print("performing DSB normalization")
# DSB normalization with CellRanger background.
dsb(data=filtered_protein, data_raw=raw_protein, add_layer=True, denoise_counts=False, scale_factor="standardize")
filtered_protein.layers["dsb_cellranger_background_standardized"] = filtered_protein.layers["dsb"]
# DSB normalization with CellRanger background.
dsb(data=filtered_protein, data_raw=raw_protein, add_layer=True, denoise_counts=False, scale_factor="mean_subtract")
filtered_protein.layers["dsb_cellranger_background_mean_subtracted"] = filtered_protein.layers["dsb"]
dsb(data=filtered_protein, data_raw=raw_protein, add_layer=True, denoise_counts=False)
print(filtered_protein.uns)
Describe the bug
I ran dsb normalization with muon.prot.pp.dsb(), and I realized that the results for my data did not fully match with the results of the DSBNormalizeProtein() original implementation in R. I printed the means of the empty droplets, that is, the empty_scaled.mean(axis=0) from line 156 in muon/prot/preproc.py.
Much to my surprise, the means were all around 2.2, which is impossible because log(0+10) is 2.3025, that is, even if all values were 0, the mean should be higher than 2.2 (at least 2.3025). I searched this in StackOverflow, which led me to https://stackoverflow.com/questions/22980487/why-is-the-mean-smaller-than-the-minimum-and-why-does-this-change-with-64bit-flo .
So it seems that, due to AnnData using by default float32, this can lead to wrong calculations of the mean. Because for large datasets, the sum of all cell measurements for a protein can be very large, too large for 32 bits. Therefore, I tried changing the anndata.X matrix from float32 to float64, and then running dsb normalization. This seemed to solve the problem, because after the change to float64, the means were around 2.31. Which makes sense since most values of empty cells are 0, and as I said log(10) = 2.3025.
This is especially problematic because the default data type in AnnData is float32. And standard deviations might be even more affected than the means.
So I propose that you coerce the anndata.X of the empty droplets into float64 before calculating the mean and std of empty droplets. This is as easy as adding, in line 144 of muon/prot/preproc.py, empty.X = empty.X.astype(np.float64).
I applied only step I of dsb normalization, without applying step II (without denoising and isotype controls).
To Reproduce
You can download public 10X data used for reproduction here: https://support.10xgenomics.com/single-cell-gene-expression/datasets/3.0.2/5k_pbmc_protein_v3_nextgem
In the following code, I very slightly modified the muon.prot.pp.dsb() function in order to 1) store the means and stds in the AnnData object and 2) to be able to choose between mean subtracting and standardization (mean subtracting and dividing by std). Then, I tried running dsb normalization with float32 (default) and with float64.
Means with float32:
('mean_empty', array([2.1995733, 2.2005668, 2.1973336, 2.20299 , 2.2034488, 2.1984806,
2.1966617, 2.1966765, 2.1987805, 2.1967235, 2.202042 , 2.200193 ,
2.1957297, 2.2018113, 2.197363 , 2.196647 , 2.197452 , 2.2016175,
2.195974 , 2.196987 , 2.1979306, 2.1962392, 2.199787 , 2.1964517,
2.1975152, 2.1963153, 2.1968145, 2.203014 , 2.1961546, 2.1962936,
2.1961162, 2.1968863], dtype=float32)
Means with float64:
'mean_empty', array([2.30580925, 2.30669376, 2.30415037, 2.3086791 , 2.30932639,
2.30478346, 2.30370452, 2.30370842, 2.30499958, 2.30373097,
2.30787521, 2.30589283, 2.30259115, 2.30726847, 2.30418295,
2.30367811, 2.30425233, 2.30698558, 2.30301876, 2.30395364,
2.30448475, 2.3033102 , 2.3055071 , 2.30348507, 2.30424637,
2.30339363, 2.30380386, 2.30861361, 2.30324365, 2.30335848,
2.30319787, 2.30382639]))
Expected behaviour
I expect that means of protein measurements across cells are higher than 2.3026, which is the minimum value.
System
muon 0.1.3
numpy 1.23.5
The text was updated successfully, but these errors were encountered: