10X multiome subsetting cancer cell line with blacklist_ratio ~ 1 for both hg38_blacklist and hg38_blacklist_unified #1741

methornton · 2024-07-09T22:13:12Z

Discussed in #1740

^{Originally posted by methornton July 9, 2024}
Hello! I am processing some cancer cell line 10X plus Mulitome data and I have kind of an outlier as far as the hg38_blacklist_ratio. First, I had to add the metadata from cellranger file 'per_barcode_metrics.csv' file. I used some info from this Q&A site for generating the blacklist_ratio.

# add blacklist ratio and fraction of reads in peaks
ap1$pct_reads_in_peaks <- ap1$atac_peak_region_fragments / ap1$atac_fragments * 100
ap1$blacklist_region_fragments <- FractionCountsInRegion(object = ap1, assay = 'ATAC', regions = blacklist_hg38)
ap1$blacklist_ratio <- ap1$blacklist_region_fragments / ap1$atac_peak_region_fragments

I was able to generate the other QC metrics and here are the images.

Any information or advice is greatly appreciated. The nucleosome values are also fairly low. What does that really mean? To me all of the other values look decent. Did I mess up the blacklist ratio calculation?

TIA

The text was updated successfully, but these errors were encountered:

timoast · 2024-07-10T03:11:37Z

The FractionCountsInRegion function will return the fraction of counts in the blacklisted region, so you shouldn't divide this again by the total fragments in the cell. However, it still does not make sense that the values are centered around one. Can you show the complete code and output of sessionInfo()?

methornton · 2024-07-10T03:24:44Z

Hello! Absolutely. Thank you!

library(Signac)
library(Seurat)
library(AnnotationHub)
library(BSgenome.Hsapiens.UCSC.hg38)
library(dplyr)
library(ggplot2)
library(patchwork)
library(reticulate)
use_python("/home/met/.conda/envs/r-reticulate/bin/python3")

# This will get the proper version of the Ensembl for your data
hub <- AnnotationHub()
query(hub, c("ensdb","homo sapiens","112"))
ensdb <- hub[["AH116860"]]

#set.seed(1234)

today <- Sys.Date()
dt <- format(today, format="%d%b%y")

# import the 10X hdf5 file with both data
ap1.10x <- Read10X_h5("/dataVol/data/Petrosyan/08Jul24/AP1/outs/filtered_feature_bc_matrix.h5")

# extract the RNA and ATAC data
ap1_rna_counts <- ap1.10x$`Gene Expression`
ap1_atac_counts <- ap1.10x$Peaks

ap1_metadata <- read.csv(file = "/dataVol/data/Petrosyan/08Jul24/AP1/outs/per_barcode_metrics.csv", sep = ",", header = TRUE, row.names = 1)

# Create seurat objects
ap1 <- CreateSeuratObject(counts = ap1_rna_counts, meta.data = ap1_metadata)

# Find percentage of mitochondrial genes
ap1[["percent.mt"]] <- PercentageFeatureSet(ap1, pattern = "^MT-")

ap1_grange.counts <- StringToGRanges(rownames(ap1_atac_counts), sep = c(":", "-"))
ap1_grange.use <- seqnames(ap1_grange.counts) %in% standardChromosomes(ap1_grange.counts)
ap1_atac_counts <- ap1_atac_counts[as.vector(ap1_grange.use), ]
ap1_annotations <- GetGRangesFromEnsDb(ensdb = ensdb)
seqlevelsStyle(ap1_annotations) <- 'UCSC'
genome(ap1_annotations) <- "hg38"

ap1_frag.file <- "/dataVol/data/Petrosyan/08Jul24/AP1/outs/atac_fragments.tsv.gz"
chrom_assay <- CreateChromatinAssay(
   counts = ap1_atac_counts,
   sep = c(":", "-"),
   genome = 'hg38',
   fragments = ap1_frag.file,
   min.cells = 10,
   annotation = ap1_annotations
 )
ap1[["ATAC"]] <- chrom_assay

DefaultAssay(ap1) <- "ATAC"

ap1 <- NucleosomeSignal(ap1)
ap1 <- TSSEnrichment(ap1, fast = FALSE)

# add blacklist ratio and fraction of reads in peaks
ap1$pct_reads_in_peaks <- ap1$atac_peak_region_fragments / ap1$atac_fragments * 100
ap1$blacklist_region_fragments <- FractionCountsInRegion(object = ap1, assay = 'ATAC', regions = blacklist_hg38)
ap1$blacklist_ratio <- ap1$blacklist_region_fragments / ap1$atac_peak_region_fragments

png(filename=paste(dt,"_AP1_TSSEnrichment_v_nCountATAC.png", sep=""), width = 6*300, height = 5*300, res = 300)
DensityScatter(ap1, x = 'nCount_ATAC', y = 'TSS.enrichment', log_x = TRUE, quantiles = TRUE)
dev.off()

ap1$high.tss <- ifelse(ap1$TSS.enrichment > 3, 'High', 'Low')

png(filename=paste(dt,"_AP1_TSS_enrichment.png", sep=""), width = 12*300, height = 5*300, res = 300)
TSSPlot(ap1, group.by = 'high.tss') + NoLegend()
dev.off()

ap1$nucleosome_group <- ifelse(ap1$nucleosome_signal > 4, 'NS > 4', 'NS < 4')

png(filename=paste(dt,"_AP1_Nucleosome_enrichment.png", sep=""), width = 12*300, height = 5*300, res = 300)
FragmentHistogram(object = ap1, group.by = 'nucleosome_group')
dev.off()

## vln plot for the filtering
png(filename=paste(dt,"_AP1_Vlnplot.png", sep=""), width = 13*300, height = 10*300, res = 300)
VlnPlot(ap1, features = c("nCount_RNA", "nCount_ATAC", "TSS.enrichment", "nucleosome_signal", "blacklist_ratio", "pct_reads_in_peaks", "percent.mt"), ncol = 4, log = TRUE, pt.size = 0.1) + NoLegend()
dev.off()

> sessionInfo()
R version 4.4.1 (2024-06-14)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 20.04.6 LTS

Matrix products: default
BLAS/LAPACK: /usr/local/lib/libopenblas_zenp-r0.3.18.so;  LAPACK version 3.9.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: America/Los_Angeles
tzcode source: system (glibc)

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] reticulate_1.38.0                 patchwork_1.2.0                  
 [3] ggplot2_3.5.1                     dplyr_1.1.4                      
 [5] BSgenome.Hsapiens.UCSC.hg38_1.4.5 BSgenome_1.72.0                  
 [7] rtracklayer_1.64.0                BiocIO_1.14.0                    
 [9] Biostrings_2.72.1                 XVector_0.44.0                   
[11] GenomicRanges_1.56.1              GenomeInfoDb_1.40.1              
[13] IRanges_2.38.1                    S4Vectors_0.42.1                 
[15] AnnotationHub_3.12.0              BiocFileCache_2.12.0             
[17] dbplyr_2.5.0                      BiocGenerics_0.50.0              
[19] Seurat_5.1.0                      SeuratObject_5.0.2               
[21] sp_2.1-4                          Signac_1.13.0                    

loaded via a namespace (and not attached):
  [1] RcppAnnoy_0.0.22            splines_4.4.1              
  [3] later_1.3.2                 bitops_1.0-7               
  [5] filelock_1.0.3              tibble_3.2.1               
  [7] polyclip_1.10-6             XML_3.99-0.17              
  [9] fastDummies_1.7.3           lifecycle_1.0.4            
 [11] globals_0.16.3              lattice_0.22-6             
 [13] MASS_7.3-61                 magrittr_2.0.3             
 [15] plotly_4.10.4               yaml_2.3.8                 
 [17] httpuv_1.6.15               sctransform_0.4.1          
 [19] spam_2.10-0                 spatstat.sparse_3.1-0      
 [21] cowplot_1.1.3               pbapply_1.7-2              
 [23] DBI_1.2.3                   RColorBrewer_1.1-3         
 [25] abind_1.4-5                 zlibbioc_1.50.0            
 [27] Rtsne_0.17                  purrr_1.0.2                
 [29] RCurl_1.98-1.14             rappdirs_0.3.3             
 [31] GenomeInfoDbData_1.2.12     ggrepel_0.9.5              
 [33] irlba_2.3.5.1               listenv_0.9.1              
 [35] spatstat.utils_3.0-5        goftest_1.2-3              
 [37] RSpectra_0.16-1             spatstat.random_3.2-3      
 [39] fitdistrplus_1.1-11         parallelly_1.37.1          
 [41] leiden_0.4.3.1              codetools_0.2-20           
 [43] DelayedArray_0.30.1         RcppRoll_0.3.0             
 [45] tidyselect_1.2.1            UCSC.utils_1.0.0           
 [47] matrixStats_1.3.0           spatstat.explore_3.2-7     
 [49] GenomicAlignments_1.40.0    jsonlite_1.8.8             
 [51] progressr_0.14.0            ggridges_0.5.6             
 [53] survival_3.7-0              tools_4.4.1                
 [55] ica_1.0-3                   Rcpp_1.0.12                
 [57] glue_1.7.0                  SparseArray_1.4.8          
 [59] gridExtra_2.3               MatrixGenerics_1.16.0      
 [61] withr_3.0.0                 BiocManager_1.30.23        
 [63] fastmap_1.2.0               fansi_1.0.6                
 [65] digest_0.6.36               R6_2.5.1                   
 [67] mime_0.12                   colorspace_2.1-0           
 [69] scattermore_1.2             tensor_1.5                 
 [71] spatstat.data_3.1-2         RSQLite_2.3.7              
 [73] utf8_1.2.4                  tidyr_1.3.1                
 [75] generics_0.1.3              data.table_1.15.4          
 [77] httr_1.4.7                  htmlwidgets_1.6.4          
 [79] S4Arrays_1.4.1              uwot_0.2.2                 
 [81] pkgconfig_2.0.3             gtable_0.3.5               
 [83] blob_1.2.4                  lmtest_0.9-40              
 [85] htmltools_0.5.8.1           dotCall64_1.1-1            
 [87] scales_1.3.0                Biobase_2.64.0             
 [89] png_0.1-8                   reshape2_1.4.4             
 [91] rjson_0.2.21                nlme_3.1-165               
 [93] curl_5.2.1                  zoo_1.8-12                 
 [95] cachem_1.1.0                stringr_1.5.1              
 [97] BiocVersion_3.19.1          KernSmooth_2.23-24         
 [99] parallel_4.4.1              miniUI_0.1.1.1             
[101] AnnotationDbi_1.66.0        restfulr_0.0.15            
[103] pillar_1.9.0                grid_4.4.1                 
[105] vctrs_0.6.5                 RANN_2.6.1                 
[107] promises_1.3.0              xtable_1.8-4               
[109] cluster_2.1.6               cli_3.6.3                  
[111] compiler_4.4.1              Rsamtools_2.20.0           
[113] rlang_1.1.4                 crayon_1.5.3               
[115] future.apply_1.11.2         plyr_1.8.9                 
[117] stringi_1.8.4               viridisLite_0.4.2          
[119] deldir_2.0-4                BiocParallel_1.38.0        
[121] munsell_0.5.1               lazyeval_0.2.2             
[123] spatstat.geom_3.2-9         Matrix_1.7-0               
[125] RcppHNSW_0.6.0              bit64_4.0.5                
[127] future_1.33.2               KEGGREST_1.44.1            
[129] shiny_1.8.1.1               SummarizedExperiment_1.34.0
[131] ROCR_1.0-11                 igraph_2.0.3               
[133] memoise_2.0.1               fastmatch_1.1-4            
[135] bit_4.0.5

methornton · 2024-07-10T03:29:18Z

So I did notice that it is really really close to 1 and if I set the filtering blacklist_ration < 1.015. I still get most of the clusters that I would get without filtering.

methornton · 2024-07-10T16:00:01Z

I also used the gencode genome fasta with cellranger-arc genomeGenerate. I had read somewhere that the differences beyond having a "chr" were that UCSC style begins on '1' and Ensembl has no "chr" and begins on '0'. Could that be causing the issue? I have another set with the gencode gtf and ensembl fa. I can check.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

10X multiome subsetting cancer cell line with blacklist_ratio ~ 1 for both hg38_blacklist and hg38_blacklist_unified #1741

10X multiome subsetting cancer cell line with blacklist_ratio ~ 1 for both hg38_blacklist and hg38_blacklist_unified #1741

methornton commented Jul 9, 2024

timoast commented Jul 10, 2024

methornton commented Jul 10, 2024

methornton commented Jul 10, 2024

methornton commented Jul 10, 2024

10X multiome subsetting cancer cell line with blacklist_ratio ~ 1 for both hg38_blacklist and hg38_blacklist_unified #1741

10X multiome subsetting cancer cell line with blacklist_ratio ~ 1 for both hg38_blacklist and hg38_blacklist_unified #1741

Comments

methornton commented Jul 9, 2024

Discussed in #1740

timoast commented Jul 10, 2024

methornton commented Jul 10, 2024

methornton commented Jul 10, 2024

methornton commented Jul 10, 2024