-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Only Identifies Global CNVs for scATAC data #25
Comments
Hi, the karyogram you have is of only one cell. Are you using a fragment file as input or bam file? If it is bam file, then it should be one per cell. Best, |
Thank you for the quick response! You're right - that is one cell. I was using a fragments.tsv.gz file, and I don't know why, but epiAnuefinder wasn't recognizing the cell ID column in it. When I manually subsetting the fragment file to match the exact column names from the tutorial, then it worked - I think it was looking for the cell barcode column to be in the 4th column, and it wasn't, so it was treating all the cells are one big cell. Is there a way to tell it which column names are expected for cell barcodes or something like that? Otherwise, I'll just need to manually edit each file. Thank you again! It's neat to see it working. |
Hi, I am glad that it worked out. Best, |
We have our own scATAC-seq processing pipeline, which comes off of 10Xs, with additional steps. That being said, I've never had an issue with ArchR. It's possible that they rely on column names, instead of column positions, when parsing these files. I am, however, having issues with the program getting stuck on some files. It'll run overnight, stuck at the GC bias. I'm not sure why. |
Hi, the GC bias correction is probably the most intensive step computationally. Best, |
Ah ok - I was seeing that it was taking forever, more than half a day, on 10 cores. I'll try uping the resources and the window size, so it has less to iterate over. |
I tried running it over the weekend with 45 cores - and it still stuck on GC correction. I've checked CPU usage, and it's been using all 45 cores for the last two days without even completing one file. epiAneufinder(input= paste('FragmentFiles/',gsub(".*ArrowFiles/|.arrow", "", frag),'.tsv',sep=''), Output: |
I can understand it, I am also running some samples and I am at day number 4. Best, |
That would be great, thank you! I tried to re-write part of it from mclapply to parlapply, to help eliminate some memory leak issues when highly parallelizing (and runtime), but that hasn't helped unfortunately. |
Thank you for that. I think it could make sense to skip, depending on the variance in GC content across the binsize. If the binsize is large enough, I would imagine the GC content variance would be minimal. One alternative is switching from loess to lowess for the correction, which is known to be much faster over larger datasets (10s of thousands). Have you considered that implementation? I may try it out on my end and see if it spells things out. I have a close to 100 samples I'd like to process and I can't wait for 4 days per sample. |
Thank you for the recommendation. Best, |
Is there a reason you did a loess fit for each cell individually when correcting for GC bias? It's not like you're going to have cell-specific GC bias when sequencing. |
We are currently trying out different things in order to boost the performance. |
Sounds good! Please keep us abreast of what you figure out. I am curious what you considerd to be unacceptable deviations from ground truth. I'm experimenting with doing one GC-correction across all cells within a given sample. I've branched your code and made those edits for it to run: If you want to test it out quickly, you can install from this branch and it should be ready to run. I just don't have the ability to easily benchmark it, but it runs much much faster. |
Thanks for your suggestion with the pseudobulk GC correction factor, this sounds like a good alternative. I will test it and let you know how well it performs on our test datasets! |
Has any formal benchmarking been done on the GC correction modifications suggested by @markphillippebworth ? Either the use of the lowess/loess function or the use of pseudobulk calculations? They sounds very promising and a speed increase would be very welcome to our workflow, since we are also hoping to run this tool on many samples. My current attempts appear stuck at the GC bias correction step for 5 days and are still not complete. |
Hey, sorry for not coming back earlier, but we haven't found a solution for this yet. We tested the suggestion from @markphillippebworth and also other alternative GC corrections. But unfortunately, it worked only on some of our test datasets, and performed very badly on others. So for now, we can not recommend this generally. As an alternative to improve the runtime, we currently implement a python version of the package epiAneufinder, which we hope to be more efficient. Our plan is to publish this early next year and we will keep you all of course updated on this. |
Thank you for the update @KatharinaSchmid ! |
Yes, thank you for the update! It's quite curious that performance was highly variable across datasets. It makes me wonder if GC correction via a 8-mer match or something similar might perform better. Regardless, I'm excited to see the run times for the python version! |
Hello! I've been trying to use your software, but I only ever see global CNV changes - no hint of single cell changes.
I've tried changing the windows size between 1e5, 5e5, and 1e6, the minFrags between 1000 to 1500, the minsizeCNV between 0 and 4, and yet I still get essentially identical resutls regardless. And the results don't really make sense to me. The data also comes from 10x, and I've been working with it for a while, so there's no significant issues with it.
Code:
epiAneufinder(input= pathToArrow, #Enter path to your fragments.tsv file or the folder containing bam files
outdir="epiAneufinder_results", #Path to the directory where results should be written
blacklist="hg38-blacklist.v2.bed.gz",
#Path to bed file that contains the blacklisted regions of your genome
windowSize=1e6,
genome="BSgenome.Hsapiens.UCSC.hg38", #Substitute with relevant BSgenome
exclude=c('chrX','chrY','chrM'),
reuse.existing=FALSE,
title_karyo="Karyogram of sample data",
ncores=4,
minFrags=1500,
minsizeCNV=0,
k=4,
plotKaryo=TRUE)
[1] "Removing old file from the output folder"
Subtracting Blacklist...
Adding Nucleotide Information...
1 of 22
2 of 22
3 of 22
4 of 22
5 of 22
6 of 22
7 of 22
8 of 22
9 of 22
10 of 22
11 of 22
12 of 22
13 of 22
14 of 22
15 of 22
16 of 22
17 of 22
18 of 22
19 of 22
20 of 22
21 of 22
22 of 22
[1] "Finished making windows successfully"
[1] chr1 10066-10126 * | B038 8
[2] chr1 10067-10137 * | B038 2
[3] chr1 10067-10138 * | B038 13
[4] chr1 10068-10132 * | B038 1
[5] chr1 10071-10124 * | B038 1
[6] chr1 10071-10125 * | B038 5
seqinfo: 24 sequences from an unspecified genome; no seqlengths
NULL
Getting Counts...
Counting reads from fragments/bed file ..
[1] "Count matrix with 1 cells and 4900 windows has been generated and will be saved as count_summary.rds"
Correcting for GC bias...
[1] "Filtering empty windows, 4900 windows remain."
[1] "Successfully identified breakpoints"
A .tsv file with the results has been written to disk. It contains the copy number states for each cell per bin.
0 denotes 'Loss', 1 denotes 'Normal', 2 denotes 'Gain'.
[1] "Successfully plotted karyogram"
Warning message:
In dir.create(outdir, recursive = TRUE) :
'epiAneufinder_results/epiAneufinder_results' already exists
The text was updated successfully, but these errors were encountered: