Following integration of multiple samples, SVD producing all NA cell embeddings #1786

gherrgo · 2024-09-25T18:36:27Z

Hi guys, looking for some help on an issue I cannot seem to solve. I have been following the integration pipeline (https://stuartlab.org/signac/articles/merging) for a set of 19 samples, and am having some issues when it comes to processing data following a merge. My individual sample pipeline looks something like this:

Reading in peaks

H9A <- Read10X_h5("09A_counts/outs/filtered_peak_bc_matrix.h5", use.names = T)
peaks_H9A <- read.table(
file = "09A_counts/outs/peaks.bed",
col.names = c("chr", "start", "end"))
gr.H9A <- makeGRangesFromDataFrame(peaks_H9A)

(defining combined_peaks using all objects)

Loading metadata

md.H9A <- read.table(
file = "09A_counts/outs/singlecell.csv",
stringsAsFactors = FALSE,
sep = ",",
header = TRUE,
row.names = 1
)[-1, ] # remove the first row

Initial filtering of low-count cells

plot(md.H9A$passed_filters)
md.H9A <- md.H9A[md.H9A$passed_filters > 100, ]

create fragment objects

frags.H9A <- CreateFragmentObject(
path = "09A_counts/outs/fragments.tsv.gz",
cells = rownames(md.H9A)
)

Quantifying peaks

H9A.counts <- FeatureMatrix(
fragments = frags.H9A,
features = combined.peaks,
cells = rownames(md.H9A)
)

Creating objects

H9A_assay <- CreateChromatinAssay(H9A.counts, fragments = frags.H9A)
s.H9A <- CreateSeuratObject(H9A_assay, assay = "ATAC", meta.data=md.H9A, project="H9A")
ncol(s.H9A) #2119
Annotation(s.H9A) <- annotation

Quality control - per sample

grange <- StringToGRanges(rownames(s.H9A), sep = c(":", "-"))
grange.use <- seqnames(grange) %in% standardChromosomes(grange)
s.H9A <- s.H9A[as.vector(grange.use), ]
s.H9A <- NucleosomeSignal(object=s.H9A) #2119 cell barcodes
s.H9A <- TSSEnrichment(object=s.H9A, fast=FALSE)

s.H9A$pct_reads_in_peaks <- [email protected]$peak_region_fragments / [email protected]$passed_filters * 100
s.H9A$blacklist_ratio <- [email protected]$blacklist_region_fragments / [email protected]$peak_region_fragments
s.H9A$high.tss <- ifelse(s.H9A$TSS.enrichment > 1, 'High', 'Low')
s.H9A$nucleosome_group <- ifelse(s.H9A$nucleosome_signal > 2, 'NS > 2', 'NS < 2')
s.H9A <- subset(
x = s.H9A,
subset = peak_region_fragments < 250 &
pct_reads_in_peaks > 0.1 &
pct_reads_in_peaks < 2 &
nucleosome_signal < 1.5 &
nucleosome_signal > 0.1 &
TSS.enrichment < 5
)
ncol(s.H9A) #1638 cells

Then merging all samples into an object called total_ATAC and running:
total_ATAC <- RunTFIDF(total_ATAC)
total_ATAC <- FindTopFeatures(total_ATAC, min.cutoff = 20)
total_ATAC[['ATAC']]
total_ATAC <- RunSVD(total_ATAC, n = 50)
total_ATAC <- RunUMAP(total_ATAC, dims = 2:50, reduction = 'lsi')

However, when I run 'RunUMAP' I receive the error:
14:33:10 UMAP embedding parameters a = 0.9922 b = 1.112
Error in checkna(X) : Missing values found in 'X'

Which is due to the cell embeddings from LSI being all NA valued:

head(total_ATAC@reductions$[email protected], 3)
LSI_1 LSI_2 LSI_3 LSI_4 LSI_5 LSI_6 LSI_7 LSI_8 LSI_9 LSI_10 LSI_11 LSI_12 LSI_13
H9A_AAACGGTTCCACTATG-1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
H9A_AAAGCGAAGCAAACAC-1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
H9A_AAAGGCAAGCTTGCTA-1 NaN NaN NaN NaN NaN NaN NaN NaN

This issue has only persisted across merged data, as I can run the SVD and UMAP on individual samples with no problem. I cannot seem to figure out how to troubleshoot this, tried many different things, all to no solution.
If you have any ideas, it would be much appreciated.
Thanks a bunch.

Here's my sessioninfo():
R version 4.4.1 (2024-06-14)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 22.04.5 LTS

Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0

Random number generation:
RNG: L'Ecuyer-CMRG
Normal: Inversion
Sample: Rejection

locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8
[4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

time zone: America/New_York
tzcode source: system (glibc)

attached base packages:
[1] grid stats4 parallel stats graphics grDevices utils datasets methods base

other attached packages:
[1] EnsDb.Hsapiens.v86_2.99.0 ensembldb_2.28.1 AnnotationFilter_1.28.0
[4] GenomicFeatures_1.56.0 SingleCellExperiment_1.26.0 rhdf5_2.48.0
[7] Rcpp_1.0.13 data.table_1.16.0 plyr_1.8.9
[10] magrittr_2.0.3 gtable_0.3.5 gtools_3.9.5
[13] gridExtra_2.3 ArchR_1.0.2 Signac_1.14.0
[16] SCP_0.5.6 stringr_1.5.1 reprex_2.1.1
[19] cowplot_1.1.3 SCINA_1.2.0 gplots_3.1.3.1
[22] MASS_7.3-61 infercnv_1.20.0 patchwork_1.3.0
[25] Seurat_5.1.0 SeuratObject_5.0.2 sp_2.1-4
[28] Matrix_1.6-5 gprofiler2_0.2.3 scales_1.3.0
[31] NMF_0.28 cluster_2.1.6 rngtools_1.5.2
[34] registry_0.5-1 DOSE_3.28.2 org.Hs.eg.db_3.18.0
[37] AnnotationDbi_1.66.0 clusterProfiler_4.10.1 readxl_1.4.3
[40] rtracklayer_1.64.0 ggpubr_0.6.0 reshape2_1.4.4
[43] tidyr_1.3.1 dplyr_1.1.4 DESeq2_1.44.0
[46] SummarizedExperiment_1.34.0 Biobase_2.64.0 MatrixGenerics_1.16.0
[49] matrixStats_1.4.1 GenomicRanges_1.56.1 GenomeInfoDb_1.40.1
[52] IRanges_2.38.1 S4Vectors_0.42.1 BiocGenerics_0.50.0
[55] tibble_3.2.1 DescTools_0.99.56 caret_6.0-94
[58] lattice_0.22-6 ggrepel_0.9.6 ggplot2_3.5.1
[61] pROC_1.18.5 doMC_1.3.8 iterators_1.0.14
[64] foreach_1.5.2

loaded via a [1] igraph_2.0.3 [5] tidyselect_1.2.1 [9] rjson_0.2.23 [13] cli_3.6.3 [17] BiocIO_1.14.0 [21] shadowtext_0.1.4 [25] leiden_0.4.3.1 [29] backports_1.5.0 [33] XML_3.99-0.17 [37] rappdirs_0.3.3 [41] ggraph_2.2.1 [45] HDF5Array_1.32.1 [49] lmtest_0.9-40 [53] BiocManager_1.30.25 [57] princurve_2.1.6 [61] lmom_3.0 [65] UCSC.utils_1.0.0 [69] timeDate_4032.109 [73] irlba_2.3.5.1 [77] yaml_2.3.10 [81] crayon_1.5.3 [85] tweenr_2.0.3 [89] GlobalOptions_0.1.2 [93] limma_3.60.4 [97] xml2_1.3.6 [101] spatstat.sparse_3.1-0 [105] xtable_1.8-4 [109] tools_4.4.1 [113] nlme_3.1-166 [117] dbplyr_2.5.0 [121] TrajectoryUtils_1.10.1 [125] rpart_4.1.23 [129] polyclip_1.10-7 [133] mvtnorm_1.3-1 [137] RcppHNSW_0.6.0 [141] gson_0.1.0 [145] graphlayouts_1.1.1 [149] GenomeInfoDbData_1.2.12 [153] memoise_2.0.1 [157] RANN_2.6.2 [161] spatstat.utils_3.1-0 [165] colorspace_2.1-1 [169] ggforce_0.4.2 [173] TH.data_1.1-2 [177] abind_1.4-8 [181] Rhdf5lib_1.26.0 [185] scatterpie_0.2.4 [189] fgsea_1.28.0 [193] compiler_4.4.1 [197] listenv_0.9.1 [201] BiocParallel_1.38.0 [205] multcomp_1.4-26 [209] rsvd_1.0.5 [213] deldir_2.0-4 [217] spatstat.explore_3.3-2 [221] spatstat.geom_3.3-3 [225] locfit_1.5-9.10 [229] GetoptLong_1.0.5 namespace (and not attached):
ica_1.0-3 plotly_4.10.4 zlibbioc_1.50.0
bit_4.0.5 doParallel_1.0.17 clue_0.3-65
blob_1.2.4 S4Arrays_1.4.1 png_0.1-8
ggplotify_0.1.2 ProtGenerics_1.36.0 goftest_1.2-3
SCpubr_2.0.2 purrr_1.0.2 uwot_0.2.2
curl_5.2.2 mime_0.12 tidytree_0.4.6
coin_1.4-3 ComplexHeatmap_2.18.0 stringi_1.8.4
rjags_4-16 slingshot_2.10.0 parallelDist_0.2.6
Exact_3.3 lubridate_1.9.3 httpuv_1.6.15
splines_4.4.1 RcppRoll_0.3.1 prodlim_2024.06.25
sctransform_0.4.1 rootSolve_1.8.2.4 DBI_1.2.3
withr_3.0.1 class_7.3-22 enrichplot_1.22.0
ggnewscale_0.5.0 tidygraph_1.3.1 formatR_1.14
htmlwidgets_1.6.4 fs_1.6.4 biomaRt_2.58.2
labeling_0.4.3 SparseArray_1.4.8 cellranger_1.1.0
reticulate_1.39.0 zoo_1.8-12 XVector_0.44.0
timechange_0.3.0 fansi_1.0.6 caTools_1.18.3
ggtree_3.10.1 R.oo_1.26.0 RSpectra_0.16-2
fastDummies_1.7.4 gridGraphics_0.5-1 lazyeval_0.2.2
phyclust_0.1-34 survival_3.7-0 scattermore_1.2
RcppAnnoy_0.0.22 RColorBrewer_1.1-3 progressr_0.14.0
later_1.3.2 ggridges_0.5.6 codetools_0.2-20
KEGGREST_1.44.1 Rtsne_0.17 shape_1.4.6.1
Rsamtools_2.20.0 filelock_1.0.3 pkgconfig_2.0.3
spatstat.univar_3.0-1 GenomicAlignments_1.40.0 aplot_0.2.3
ape_5.8 viridisLite_0.4.2 gridBase_0.4-7
car_3.1-2 fastcluster_1.2.6 httr_1.4.7
globals_0.16.3 hardhat_1.4.0 broom_1.0.6
futile.logger_1.4.3 lambda.r_1.2.4 HDO.db_0.99.1
assertthat_0.2.1 digest_0.6.37 farver_2.1.2
ModelMetrics_1.2.2.2 yulab.utils_0.1.7 viridis_0.6.5
glue_1.7.0 cachem_1.1.0 BiocFileCache_2.10.2
proxyC_0.4.1 generics_0.1.3 Biostrings_2.72.1
parallelly_1.38.0 statmod_1.5.0 R.cache_0.16.0
carData_3.0-5 pbapply_1.7-2 spam_2.10-0
utf8_1.2.4 gower_1.0.1 SeuratWrappers_0.3.0
ggsignif_0.6.4 shiny_1.9.1 lava_1.8.0
R.utils_2.12.3 rhdf5filters_1.16.0 RCurl_1.98-1.16
R.methodsS3_1.8.2 gld_2.6.6 future_1.34.0
Cairo_1.6-2 spatstat.data_3.1-2 rstudioapi_0.16.0
hms_1.1.3 fitdistrplus_1.2-1 munsell_0.5.1
rlang_1.1.4 ipred_0.9-15 dotCall64_1.1-1
circlize_0.4.16 coda_0.19-4.1 e1071_1.7-16
remotes_2.5.0 recipes_1.1.0 modeltools_0.2-23
GOSemSim_2.28.1 libcoin_1.0-10 treeio_1.26.0
futile.options_1.0.1 bitops_1.0-8 promises_1.3.0
RSQLite_2.3.7 qvalue_2.34.0 sandwich_3.1-1
DelayedArray_0.30.1 proxy_0.4-27 GO.db_3.18.0
forcats_1.0.0 prettyunits_1.2.0 boot_1.3-31
edgeR_4.2.1 tensor_1.5 progress_1.2.3
spatstat.random_3.3-2 R6_2.5.1 fastmap_1.2.0
fastmatch_1.1-4 rstatix_0.7.2 ROCR_1.0-11
nnet_7.3-19 KernSmooth_2.23-24 miniUI_0.1.1.1
htmltools_0.5.8.1 RcppParallel_5.1.9 bit64_4.0.5
lifecycle_1.0.4 restfulr_0.0.15 vctrs_0.6.5
ggfun_0.1.6 future.apply_1.11.2 pillar_1.9.0
jsonlite_1.8.8 expm_1.0-0 argparse_2.2.3

gherrgo · 2024-09-26T13:13:54Z

I should add, I attempted the example workflow for merging data and I think I may have isolated the issue. When running RunSVD(combined) I receive the error:
Warning in irlba(A = t(x = object), nv = n, work = irlba.work, tol = tol) :
did not converge--results might be invalid!; try increasing work or maxit
Scaling cell embeddings

And then receive all NA's for cell embeddings across all of the example cells:

combined@reductions$[email protected]
LSI_1 LSI_2 LSI_3 LSI_4 LSI_5 LSI_6 LSI_7 LSI_8 LSI_9 LSI_10 LSI_11 LSI_12 LSI_13
500_AAACTGCAGACTCGGA-1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
500_AAAGATGCACCTATTT-1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
500_AAAGATGCAGATACAA-1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

Could this be an error forced by irlba ? How should I circumvent?
Thanks.

gherrgo added the bug Something isn't working label Sep 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Following integration of multiple samples, SVD producing all NA cell embeddings #1786

Following integration of multiple samples, SVD producing all NA cell embeddings #1786

gherrgo commented Sep 25, 2024

gherrgo commented Sep 26, 2024

Following integration of multiple samples, SVD producing all NA cell embeddings #1786

Following integration of multiple samples, SVD producing all NA cell embeddings #1786

Comments

gherrgo commented Sep 25, 2024

Reading in peaks

Loading metadata

Initial filtering of low-count cells

create fragment objects

Quantifying peaks

Creating objects

Quality control - per sample

gherrgo commented Sep 26, 2024