Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem in buildNhoodGraph on a large dataset #290

Open
moinfar opened this issue Sep 1, 2023 · 8 comments
Open

Problem in buildNhoodGraph on a large dataset #290

moinfar opened this issue Sep 1, 2023 · 8 comments

Comments

@moinfar
Copy link

moinfar commented Sep 1, 2023

Describe the bug
Hi,
First, thanks for providing and maintenance of the package.
I am using Milo on a dataset of 588k cells. I am trying to run Milo in Jupyter environment using the rpy2 interface (like here).
Everything is fine when I subset the cells to 10% (59k cells). However, when I run it on the whole data with 588k cells, I face an error on the buildNhoodGraph function. It looks like an overflow problem, but I am not sure. Unfortunately, the data is too big for me to upload. But please let me know if I can provide anything else.

Minimum code example
Minimum example to reproduce the error

# python
import os
import scanpy as sc
import anndata as ad
import numpy as np
import pandas as pd
import anndata as ad
import anndata2ri
import rpy2
from rpy2.robjects import r

sc.settings.set_figure_params(dpi=300, frameon=False)

anndata2ri.activate()
%load_ext rpy2.ipython
%%R
library(miloR)
library(igraph)
library(dplyr)
library(ggplot2)
library(scater)
library(patchwork)

MILO_K = 150
# python
oroginal_adata = sc.read("some/path.h5ad")
adata = oroginal_adata
# adata = sc.pp.subsample(oroginal_adata, fraction=0.1, copy=True)

sc.pp.neighbors(adata, n_neighbors=10, use_rep='latent')
sc.tl.umap(adata)

adata_no_knn = adata.copy()
adata_no_knn.obsp = None
adata_no_knn.uns.pop("neighbors")
adata_no_knn

knn_adjacency = adata.obsp["connectivities"]
%%R magic_args="-i adata_no_knn"
LATENT_DIM = ncol(reducedDim(adata_no_knn, 'latent'))
milo <- Milo(adata_no_knn)
milo
%%R magic_args="-i knn_adjacency"
milo_graph <- buildFromAdjacency(knn_adjacency, k=MILO_K, is.binary=TRUE)
graph(milo) <- miloR::graph(milo_graph)
milo <- buildGraph(milo, k=MILO_K, d=LATENT_DIM, reduced.dim='latent')
milo <- makeNhoods(milo, prop = 0.1, k = MILO_K, d=LATENT_DIM, refined = TRUE, reduced_dims='latent')
plotNhoodSizeHist(milo)
%%R
meta_data = data.frame(colData(milo))[,c("sample_id", "Site", "condition")]
milo <- countCells(milo, meta.data = meta_data, sample="sample_id")

## Calculate distances between cells in neighbourhoods
## for spatial FDR correction
milo <- calcNhoodDistance(milo, d=LATENT_DIM, reduced.dim='latent')

design_df <- distinct(meta_data)
rownames(design_df) <- design_df$sample_id
da_results <- testNhoods(milo, design = ~ Site + condition, design.df = design_df, reduced.dim='latent')
%%R
tryCatch({
    milo <- buildNhoodGraph(milo)
}, error=function(err) {
    print(err)
})

Full error traceback

Error in if (any(i < 0L)) { : missing value where TRUE/FALSE needed
In addition: Warning message:
In int2i(as.integer(i), n) : NAs introduced by coercion to integer range
Error in if (any(i < 0L)) { : missing value where TRUE/FALSE needed 

Session info
Output of sessionInfo()

R version 4.3.1 (2023-06-16)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.3 LTS

Matrix products: default
BLAS/LAPACK: /dss/dsshome1/03/di93zoz/miniconda3/envs/multigrate2/lib/libopenblasp-r0.3.23.so;  LAPACK version 3.11.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: Etc/UTC
tzcode source: system (glibc)

attached base packages:
[1] stats4    tools     stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] Matrix_1.5-4.1              patchwork_1.1.3            
 [3] scater_1.28.0               scuttle_1.10.2             
 [5] SingleCellExperiment_1.22.0 SummarizedExperiment_1.30.2
 [7] Biobase_2.60.0              GenomicRanges_1.52.0       
 [9] GenomeInfoDb_1.36.1         IRanges_2.34.1             
[11] S4Vectors_0.38.1            BiocGenerics_0.46.0        
[13] MatrixGenerics_1.12.3       matrixStats_1.0.0          
[15] ggplot2_3.4.3               dplyr_1.1.2                
[17] igraph_1.5.1                miloR_1.8.1                
[19] edgeR_3.42.4                limma_3.56.2               

loaded via a namespace (and not attached):
 [1] tidyselect_1.2.0          viridisLite_0.4.2        
 [3] vipor_0.4.5               farver_2.1.1             
 [5] viridis_0.6.4             bitops_1.0-7             
 [7] ggraph_2.1.0              RCurl_1.98-1.12          
 [9] tweenr_2.0.2              digest_0.6.33            
[11] rsvd_1.0.5                lifecycle_1.0.3          
[13] statmod_1.5.0             magrittr_2.0.3           
[15] compiler_4.3.1            rlang_1.1.1              
[17] utf8_1.2.3                labeling_0.4.2           
[19] S4Arrays_1.0.5            graphlayouts_1.0.0       
[21] DelayedArray_0.26.7       RColorBrewer_1.1-3       
[23] abind_1.4-5               BiocParallel_1.34.2      
[25] withr_2.5.0               purrr_1.0.2              
[27] grid_4.3.1                polyclip_1.10-4          
[29] fansi_1.0.4               beachmat_2.16.0          
[31] colorspace_2.1-0          scales_1.2.1             
[33] gtools_3.9.4              MASS_7.3-60              
[35] cli_3.6.1                 crayon_1.5.2             
[37] generics_0.1.3            DelayedMatrixStats_1.22.5
[39] ggbeeswarm_0.7.2          ggforce_0.4.1            
[41] stringr_1.5.0             splines_4.3.1            
[43] zlibbioc_1.46.0           parallel_4.3.1           
[45] XVector_0.40.0            vctrs_0.6.3              
[47] BiocSingular_1.16.0       BiocNeighbors_1.18.0     
[49] ggrepel_0.9.3             irlba_2.3.5.1            
[51] beeswarm_0.4.0            packrat_0.9.1            
[53] locfit_1.5-9.8            tidyr_1.3.0              
[55] glue_1.6.2                codetools_0.2-19         
[57] cowplot_1.1.1             stringi_1.7.12           
[59] gtable_0.3.4              ScaledMatrix_1.8.1       
[61] munsell_0.5.0             tibble_3.2.1             
[63] pillar_1.9.0              GenomeInfoDbData_1.2.10  
[65] R6_2.5.1                  sparseMatrixStats_1.12.2 
[67] tidygraph_1.2.3           lattice_0.21-8           
[69] Rcpp_1.0.11               gridExtra_2.3            
[71] pkgconfig_2.0.3          

P.S
Some minor differences between 10% and 100%:

  1. Computing Milo on 100% of the data takes so long (maybe 10 hours) to reach the mentioned line.
  2. I get this warning when subsetting to 10% of the data on da_results <- testNhoods(...):
In addition: Warning message:
In testNhoods(milo, design = ~Site + condition, design.df = design_df,  :
  Sample names in design matrix and nhood counts are not matched. Reordering
@MikeDMorgan
Copy link
Member

Hi @moinfar - two things to start with: 1) you don't need to run the nhood refinement with 100% of the data - that is entirely redundant and a sure-fire way to kill all of your statistical power, 2) use the graph-based nhood refinement and spatial FDR correction - it's orders of magnitude faster.

Finally, there is a milopy implementation maintained by @emdann that you may find useful: https://github.com/emdann/milopy

@moinfar
Copy link
Author

moinfar commented Sep 18, 2023

Hi @MikeDMorgan,

Thanks for your reply.

  1. Sorry, I think I am missing something. As far as I understood, the nhood refinement proportion is given as a parameter in the makeNhoods function. I use the following code in both mentioned runs:
milo <- makeNhoods(milo, prop = 0.1, k = MILO_K, d=LATENT_DIM, refined = TRUE, reduced_dims='latent')

The difference between the two runs is that I subset the whole data (initial adata) to 10% in the working one, while I use the whole data in the other. The latter results in the given error. Am I missing something?

  1. Thanks. Can you please explain this in more detail? Or some links to the documentation. For graph-based nhood refinement, I think I can pass refinement_scheme="graph". Right? But, I am not sure if spatial FDR correction is something other than what I do now.

Thanks for mentioning the milopy implementation. I will take a look at it.
I would also appreciate it if you could help me with the original error I faced when using the main (588k) data.

@MikeDMorgan
Copy link
Member

Hi @moinfar I see - sorry from your post I inferred you set p=1.0 - the original nhood refinement time complexity for the jth nhood is $O(n^2_j)$ so scales ~quadratically with the number of cells. The newer algorithms scale much better and run in a fraction of the time. The most up-to-date version on Bioconductor has the relevant function documentation. The manuscript is currently in the works.

I haven't seen that particular error before - it suggests there are NA of Inf values in your nhoods() matrix. Could you check that the scanpy adjacency matrix is binary.

@ManarHashemTaha
Copy link

Did you solve the error @moinfar ?

@moinfar
Copy link
Author

moinfar commented Apr 25, 2024

@ManarHashemTaha
Unfortunately not.
I used the newest algorithm, but the problem still persisted.
I checked the scanpy neighborhood graph, and it was OK (no problem in scanpy results).

@Xiaohan0416
Copy link

Oh no, I have the same issue here. My data includes 221k cells

@BokaiZhu
Copy link

Hi all,

Has anyone figured out a solution here? Looks like for large datasets (mine ~ 600k cells), everything can work until comes to any function that involves the calculation of nhood adj matrix

nh_intersect_mat <- .build_nhood_adjacency(nhoods(x), overlap=overlap).

Eg. when running functions like buildNhoodGraph or findNhoodMarkers would get

Error in if (any(i < 0L)) { : missing value where TRUE/FALSE needed
In addition: Warning message:
In int2i(as.integer(i), n) : NAs introduced by coercion to integer range
Error in if (any(i < 0L)) { : missing value where TRUE/FALSE needed

It looks like a size issue e.g. once subset cells to 20k batches everything is good.

Thanks all!

@MikeDMorgan
Copy link
Member

First, can you post your sessionInfo whenever you report an apparent bug or error.

Second, if this persists when using the latest version of Milo from Bioconductor, can you please create a minimally reproducible example, and I will investigate further.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants