Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

10X multiome subsetting cancer cell line with blacklist_ratio ~ 1 for both hg38_blacklist and hg38_blacklist_unified #1741

Open
methornton opened this issue Jul 9, 2024 Discussed in #1740 · 4 comments

Comments

@methornton
Copy link

Discussed in #1740

Originally posted by methornton July 9, 2024
Hello! I am processing some cancer cell line 10X plus Mulitome data and I have kind of an outlier as far as the hg38_blacklist_ratio. First, I had to add the metadata from cellranger file 'per_barcode_metrics.csv' file. I used some info from this Q&A site for generating the blacklist_ratio.

# add blacklist ratio and fraction of reads in peaks
ap1$pct_reads_in_peaks <- ap1$atac_peak_region_fragments / ap1$atac_fragments * 100
ap1$blacklist_region_fragments <- FractionCountsInRegion(object = ap1, assay = 'ATAC', regions = blacklist_hg38)
ap1$blacklist_ratio <- ap1$blacklist_region_fragments / ap1$atac_peak_region_fragments

I was able to generate the other QC metrics and here are the images.
09Jul24_AP1_Vlnplot
09Jul24_AP1_Vlnplot_hg38_unified

Any information or advice is greatly appreciated. The nucleosome values are also fairly low. What does that really mean? To me all of the other values look decent. Did I mess up the blacklist ratio calculation?

TIA

@timoast
Copy link
Collaborator

timoast commented Jul 10, 2024

The FractionCountsInRegion function will return the fraction of counts in the blacklisted region, so you shouldn't divide this again by the total fragments in the cell. However, it still does not make sense that the values are centered around one. Can you show the complete code and output of sessionInfo()?

@methornton
Copy link
Author

Hello! Absolutely. Thank you!

library(Signac)
library(Seurat)
library(AnnotationHub)
library(BSgenome.Hsapiens.UCSC.hg38)
library(dplyr)
library(ggplot2)
library(patchwork)
library(reticulate)
use_python("/home/met/.conda/envs/r-reticulate/bin/python3")

# This will get the proper version of the Ensembl for your data
hub <- AnnotationHub()
query(hub, c("ensdb","homo sapiens","112"))
ensdb <- hub[["AH116860"]]

#set.seed(1234)

today <- Sys.Date()
dt <- format(today, format="%d%b%y")

# import the 10X hdf5 file with both data
ap1.10x <- Read10X_h5("/dataVol/data/Petrosyan/08Jul24/AP1/outs/filtered_feature_bc_matrix.h5")

# extract the RNA and ATAC data
ap1_rna_counts <- ap1.10x$`Gene Expression`
ap1_atac_counts <- ap1.10x$Peaks

ap1_metadata <- read.csv(file = "/dataVol/data/Petrosyan/08Jul24/AP1/outs/per_barcode_metrics.csv", sep = ",", header = TRUE, row.names = 1)

# Create seurat objects
ap1 <- CreateSeuratObject(counts = ap1_rna_counts, meta.data = ap1_metadata)

# Find percentage of mitochondrial genes
ap1[["percent.mt"]] <- PercentageFeatureSet(ap1, pattern = "^MT-")

ap1_grange.counts <- StringToGRanges(rownames(ap1_atac_counts), sep = c(":", "-"))
ap1_grange.use <- seqnames(ap1_grange.counts) %in% standardChromosomes(ap1_grange.counts)
ap1_atac_counts <- ap1_atac_counts[as.vector(ap1_grange.use), ]
ap1_annotations <- GetGRangesFromEnsDb(ensdb = ensdb)
seqlevelsStyle(ap1_annotations) <- 'UCSC'
genome(ap1_annotations) <- "hg38"

ap1_frag.file <- "/dataVol/data/Petrosyan/08Jul24/AP1/outs/atac_fragments.tsv.gz"
chrom_assay <- CreateChromatinAssay(
   counts = ap1_atac_counts,
   sep = c(":", "-"),
   genome = 'hg38',
   fragments = ap1_frag.file,
   min.cells = 10,
   annotation = ap1_annotations
 )
ap1[["ATAC"]] <- chrom_assay

DefaultAssay(ap1) <- "ATAC"

ap1 <- NucleosomeSignal(ap1)
ap1 <- TSSEnrichment(ap1, fast = FALSE)

# add blacklist ratio and fraction of reads in peaks
ap1$pct_reads_in_peaks <- ap1$atac_peak_region_fragments / ap1$atac_fragments * 100
ap1$blacklist_region_fragments <- FractionCountsInRegion(object = ap1, assay = 'ATAC', regions = blacklist_hg38)
ap1$blacklist_ratio <- ap1$blacklist_region_fragments / ap1$atac_peak_region_fragments

png(filename=paste(dt,"_AP1_TSSEnrichment_v_nCountATAC.png", sep=""), width = 6*300, height = 5*300, res = 300)
DensityScatter(ap1, x = 'nCount_ATAC', y = 'TSS.enrichment', log_x = TRUE, quantiles = TRUE)
dev.off()

ap1$high.tss <- ifelse(ap1$TSS.enrichment > 3, 'High', 'Low')

png(filename=paste(dt,"_AP1_TSS_enrichment.png", sep=""), width = 12*300, height = 5*300, res = 300)
TSSPlot(ap1, group.by = 'high.tss') + NoLegend()
dev.off()

ap1$nucleosome_group <- ifelse(ap1$nucleosome_signal > 4, 'NS > 4', 'NS < 4')

png(filename=paste(dt,"_AP1_Nucleosome_enrichment.png", sep=""), width = 12*300, height = 5*300, res = 300)
FragmentHistogram(object = ap1, group.by = 'nucleosome_group')
dev.off()

## vln plot for the filtering
png(filename=paste(dt,"_AP1_Vlnplot.png", sep=""), width = 13*300, height = 10*300, res = 300)
VlnPlot(ap1, features = c("nCount_RNA", "nCount_ATAC", "TSS.enrichment", "nucleosome_signal", "blacklist_ratio", "pct_reads_in_peaks", "percent.mt"), ncol = 4, log = TRUE, pt.size = 0.1) + NoLegend()
dev.off()





> sessionInfo()
R version 4.4.1 (2024-06-14)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 20.04.6 LTS

Matrix products: default
BLAS/LAPACK: /usr/local/lib/libopenblas_zenp-r0.3.18.so;  LAPACK version 3.9.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: America/Los_Angeles
tzcode source: system (glibc)

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] reticulate_1.38.0                 patchwork_1.2.0                  
 [3] ggplot2_3.5.1                     dplyr_1.1.4                      
 [5] BSgenome.Hsapiens.UCSC.hg38_1.4.5 BSgenome_1.72.0                  
 [7] rtracklayer_1.64.0                BiocIO_1.14.0                    
 [9] Biostrings_2.72.1                 XVector_0.44.0                   
[11] GenomicRanges_1.56.1              GenomeInfoDb_1.40.1              
[13] IRanges_2.38.1                    S4Vectors_0.42.1                 
[15] AnnotationHub_3.12.0              BiocFileCache_2.12.0             
[17] dbplyr_2.5.0                      BiocGenerics_0.50.0              
[19] Seurat_5.1.0                      SeuratObject_5.0.2               
[21] sp_2.1-4                          Signac_1.13.0                    

loaded via a namespace (and not attached):
  [1] RcppAnnoy_0.0.22            splines_4.4.1              
  [3] later_1.3.2                 bitops_1.0-7               
  [5] filelock_1.0.3              tibble_3.2.1               
  [7] polyclip_1.10-6             XML_3.99-0.17              
  [9] fastDummies_1.7.3           lifecycle_1.0.4            
 [11] globals_0.16.3              lattice_0.22-6             
 [13] MASS_7.3-61                 magrittr_2.0.3             
 [15] plotly_4.10.4               yaml_2.3.8                 
 [17] httpuv_1.6.15               sctransform_0.4.1          
 [19] spam_2.10-0                 spatstat.sparse_3.1-0      
 [21] cowplot_1.1.3               pbapply_1.7-2              
 [23] DBI_1.2.3                   RColorBrewer_1.1-3         
 [25] abind_1.4-5                 zlibbioc_1.50.0            
 [27] Rtsne_0.17                  purrr_1.0.2                
 [29] RCurl_1.98-1.14             rappdirs_0.3.3             
 [31] GenomeInfoDbData_1.2.12     ggrepel_0.9.5              
 [33] irlba_2.3.5.1               listenv_0.9.1              
 [35] spatstat.utils_3.0-5        goftest_1.2-3              
 [37] RSpectra_0.16-1             spatstat.random_3.2-3      
 [39] fitdistrplus_1.1-11         parallelly_1.37.1          
 [41] leiden_0.4.3.1              codetools_0.2-20           
 [43] DelayedArray_0.30.1         RcppRoll_0.3.0             
 [45] tidyselect_1.2.1            UCSC.utils_1.0.0           
 [47] matrixStats_1.3.0           spatstat.explore_3.2-7     
 [49] GenomicAlignments_1.40.0    jsonlite_1.8.8             
 [51] progressr_0.14.0            ggridges_0.5.6             
 [53] survival_3.7-0              tools_4.4.1                
 [55] ica_1.0-3                   Rcpp_1.0.12                
 [57] glue_1.7.0                  SparseArray_1.4.8          
 [59] gridExtra_2.3               MatrixGenerics_1.16.0      
 [61] withr_3.0.0                 BiocManager_1.30.23        
 [63] fastmap_1.2.0               fansi_1.0.6                
 [65] digest_0.6.36               R6_2.5.1                   
 [67] mime_0.12                   colorspace_2.1-0           
 [69] scattermore_1.2             tensor_1.5                 
 [71] spatstat.data_3.1-2         RSQLite_2.3.7              
 [73] utf8_1.2.4                  tidyr_1.3.1                
 [75] generics_0.1.3              data.table_1.15.4          
 [77] httr_1.4.7                  htmlwidgets_1.6.4          
 [79] S4Arrays_1.4.1              uwot_0.2.2                 
 [81] pkgconfig_2.0.3             gtable_0.3.5               
 [83] blob_1.2.4                  lmtest_0.9-40              
 [85] htmltools_0.5.8.1           dotCall64_1.1-1            
 [87] scales_1.3.0                Biobase_2.64.0             
 [89] png_0.1-8                   reshape2_1.4.4             
 [91] rjson_0.2.21                nlme_3.1-165               
 [93] curl_5.2.1                  zoo_1.8-12                 
 [95] cachem_1.1.0                stringr_1.5.1              
 [97] BiocVersion_3.19.1          KernSmooth_2.23-24         
 [99] parallel_4.4.1              miniUI_0.1.1.1             
[101] AnnotationDbi_1.66.0        restfulr_0.0.15            
[103] pillar_1.9.0                grid_4.4.1                 
[105] vctrs_0.6.5                 RANN_2.6.1                 
[107] promises_1.3.0              xtable_1.8-4               
[109] cluster_2.1.6               cli_3.6.3                  
[111] compiler_4.4.1              Rsamtools_2.20.0           
[113] rlang_1.1.4                 crayon_1.5.3               
[115] future.apply_1.11.2         plyr_1.8.9                 
[117] stringi_1.8.4               viridisLite_0.4.2          
[119] deldir_2.0-4                BiocParallel_1.38.0        
[121] munsell_0.5.1               lazyeval_0.2.2             
[123] spatstat.geom_3.2-9         Matrix_1.7-0               
[125] RcppHNSW_0.6.0              bit64_4.0.5                
[127] future_1.33.2               KEGGREST_1.44.1            
[129] shiny_1.8.1.1               SummarizedExperiment_1.34.0
[131] ROCR_1.0-11                 igraph_2.0.3               
[133] memoise_2.0.1               fastmatch_1.1-4            
[135] bit_4.0.5                  

@methornton
Copy link
Author

So I did notice that it is really really close to 1 and if I set the filtering blacklist_ration < 1.015. I still get most of the clusters that I would get without filtering.

@methornton
Copy link
Author

I also used the gencode genome fasta with cellranger-arc genomeGenerate. I had read somewhere that the differences beyond having a "chr" were that UCSC style begins on '1' and Ensembl has no "chr" and begins on '0'. Could that be causing the issue? I have another set with the gencode gtf and ensembl fa. I can check.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants