Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Only Identifies Global CNVs for scATAC data #25

Open
markphillippebworth opened this issue Jun 28, 2024 · 20 comments
Open

Only Identifies Global CNVs for scATAC data #25

markphillippebworth opened this issue Jun 28, 2024 · 20 comments

Comments

@markphillippebworth
Copy link

Hello! I've been trying to use your software, but I only ever see global CNV changes - no hint of single cell changes.

I've tried changing the windows size between 1e5, 5e5, and 1e6, the minFrags between 1000 to 1500, the minsizeCNV between 0 and 4, and yet I still get essentially identical resutls regardless. And the results don't really make sense to me. The data also comes from 10x, and I've been working with it for a while, so there's no significant issues with it.

Karyogram (10)

Code:

epiAneufinder(input= pathToArrow, #Enter path to your fragments.tsv file or the folder containing bam files
outdir="epiAneufinder_results", #Path to the directory where results should be written
blacklist="hg38-blacklist.v2.bed.gz",
#Path to bed file that contains the blacklisted regions of your genome
windowSize=1e6,
genome="BSgenome.Hsapiens.UCSC.hg38", #Substitute with relevant BSgenome
exclude=c('chrX','chrY','chrM'),
reuse.existing=FALSE,
title_karyo="Karyogram of sample data",
ncores=4,
minFrags=1500,
minsizeCNV=0,
k=4,
plotKaryo=TRUE)

[1] "Removing old file from the output folder"
Subtracting Blacklist...
Adding Nucleotide Information...
1 of 22
2 of 22
3 of 22
4 of 22
5 of 22
6 of 22
7 of 22
8 of 22
9 of 22
10 of 22
11 of 22
12 of 22
13 of 22
14 of 22
15 of 22
16 of 22
17 of 22
18 of 22
19 of 22
20 of 22
21 of 22
22 of 22
[1] "Finished making windows successfully"

[1] "Obtaining the fragments tsv file"
==================================================
GRanges object with 6 ranges and 2 metadata columns:
  seqnames      ranges strand |     barcode       pcr
     <Rle>   <IRanges>  <Rle> | <character> <integer>

[1] chr1 10066-10126 * | B038 8
[2] chr1 10067-10137 * | B038 2
[3] chr1 10067-10138 * | B038 13
[4] chr1 10068-10132 * | B038 1
[5] chr1 10071-10124 * | B038 1
[6] chr1 10071-10125 * | B038 5

seqinfo: 24 sequences from an unspecified genome; no seqlengths
NULL
Getting Counts...
Counting reads from fragments/bed file ..
[1] "Count matrix with 1 cells and 4900 windows has been generated and will be saved as count_summary.rds"
Correcting for GC bias...
[1] "Filtering empty windows, 4900 windows remain."
[1] "Successfully identified breakpoints"
A .tsv file with the results has been written to disk. It contains the copy number states for each cell per bin.
0 denotes 'Loss', 1 denotes 'Normal', 2 denotes 'Gain'.
[1] "Successfully plotted karyogram"
Warning message:
In dir.create(outdir, recursive = TRUE) :
'epiAneufinder_results/epiAneufinder_results' already exists

@thek71
Copy link
Collaborator

thek71 commented Jun 28, 2024

Hi,

the karyogram you have is of only one cell.
The software does no produce global CNVs, it produces CNVs per cell.
Also, in the messages that the software prints you can see that the count matrix that is calculated includes only one cell, hence the plot.
[1] "Count matrix with 1 cells and 4900 windows has been generated and will be saved as count_summary.rds"

Are you using a fragment file as input or bam file? If it is bam file, then it should be one per cell.
If you are using a fragments file, please there might be a formatting issue.
Another possibility is that the minimum fragments variable is still too high for the dataset you have, although if you have set that to 1000 I find it unlikely.

Best,
Katia

@markphillippebworth
Copy link
Author

markphillippebworth commented Jul 1, 2024

Thank you for the quick response! You're right - that is one cell.

I was using a fragments.tsv.gz file, and I don't know why, but epiAnuefinder wasn't recognizing the cell ID column in it. When I manually subsetting the fragment file to match the exact column names from the tutorial, then it worked - I think it was looking for the cell barcode column to be in the 4th column, and it wasn't, so it was treating all the cells are one big cell. Is there a way to tell it which column names are expected for cell barcodes or something like that? Otherwise, I'll just need to manually edit each file.

Thank you again! It's neat to see it working.

image

@thek71
Copy link
Collaborator

thek71 commented Jul 2, 2024

Hi,

I am glad that it worked out.
The way that epiAneufinder is designed, as you mentioned, is to parse the fragment file using the column content. In the standard 10x fragments the fourth column has the barcodes and we designed the algorithm to be compatible with the different types of data format specifications.
I am not sure how your dataset looks like, but if it is a non-standard fragment format then unfortunately you will have to convert it first.
Is your dataset coming from a different vendor that has a different format specification?

Best,
Katia

@markphillippebworth
Copy link
Author

We have our own scATAC-seq processing pipeline, which comes off of 10Xs, with additional steps. That being said, I've never had an issue with ArchR. It's possible that they rely on column names, instead of column positions, when parsing these files.

I am, however, having issues with the program getting stuck on some files. It'll run overnight, stuck at the GC bias. I'm not sure why.

@thek71
Copy link
Collaborator

thek71 commented Jul 10, 2024

Hi,

the GC bias correction is probably the most intensive step computationally.
For me depending on the dataset it might also take half a day, depending on the resources I give.
Do you see an error or it is just taking too long?
You can try with more cores.

Best,
Katia

@markphillippebworth
Copy link
Author

markphillippebworth commented Jul 10, 2024

Ah ok - I was seeing that it was taking forever, more than half a day, on 10 cores. I'll try uping the resources and the window size, so it has less to iterate over.

@markphillippebworth
Copy link
Author

markphillippebworth commented Jul 15, 2024

I tried running it over the weekend with 45 cores - and it still stuck on GC correction. I've checked CPU usage, and it's been using all 45 cores for the last two days without even completing one file.

epiAneufinder(input= paste('FragmentFiles/',gsub(".*ArrowFiles/|.arrow", "", frag),'.tsv',sep=''),
#Enter path to your fragments.tsv file or the folder containing bam files
outdir= paste("Results/", gsub(".*ArrowFiles/|.arrow", "", frag) ,sep=''),
#Path to the directory where results should be written
blacklist = "hg38-blacklist.v2.bed.gz",
#Path to bed file that contains the blacklisted regions of your genome
windowSize=1e5,
genome="BSgenome.Hsapiens.UCSC.hg38", #Substitute with relevant BSgenome
exclude=c('chrX','chrY','chrM'),
reuse.existing=FALSE,
title_karyo="Karyogram of sample data",
ncores=45,
minFrags=1000,
minsizeCNV=1,
k=1,
plotKaryo=TRUE)

Output:
Counting reads from fragments/bed file ..
[1] "Count matrix with 14940 cells and 26088 windows has been generated and will be saved as count_summary.rds"

@thek71
Copy link
Collaborator

thek71 commented Jul 15, 2024

I can understand it, I am also running some samples and I am at day number 4.
We are planning to make the GC correction step optional, but we first need to check how the results are going to be affected, before saying that we recommend that.
It will take a bit of time, but I'll keep you posted.

Best,
Katia

@markphillippebworth
Copy link
Author

That would be great, thank you! I tried to re-write part of it from mclapply to parlapply, to help eliminate some memory leak issues when highly parallelizing (and runtime), but that hasn't helped unfortunately.

@markphillippebworth
Copy link
Author

Thank you for that. I think it could make sense to skip, depending on the variance in GC content across the binsize. If the binsize is large enough, I would imagine the GC content variance would be minimal.

One alternative is switching from loess to lowess for the correction, which is known to be much faster over larger datasets (10s of thousands). Have you considered that implementation? I may try it out on my end and see if it spells things out. I have a close to 100 samples I'd like to process and I can't wait for 4 days per sample.
https://stats.stackexchange.com/questions/161069/difference-between-loess-and-lowess#:~:text=lowess%20and%20loess%20count%20iterations,i.e.%2C%20iterations%3Diter%2B1

@thek71
Copy link
Collaborator

thek71 commented Jul 15, 2024

Thank you for the recommendation.
We can have a look and have it as s possible option for cases with large number of samples.
Of course we will need again to test the performance both in terms of resources but also on overall performance.

Best,
Katia

@markphillippebworth
Copy link
Author

Is there a reason you did a loess fit for each cell individually when correcting for GC bias?
Theoretically, all the cells would have been processed and sequenced together. You could just collapse all the cells together into one pseudobo sample, run your loess fit to generate your gc correction score once, and then apply that to all your cells evenly.

It's not like you're going to have cell-specific GC bias when sequencing.

@thek71
Copy link
Collaborator

thek71 commented Aug 2, 2024

We are currently trying out different things in order to boost the performance.
We have already tried skipping the GC correction, but the results compared to the groundtruth were far worse than what we would accept.
We have some other things in mind to try.
Thank you once again for your recommendation. We will add that in our list.

@markphillippebworth
Copy link
Author

Sounds good! Please keep us abreast of what you figure out. I am curious what you considerd to be unacceptable deviations from ground truth.

I'm experimenting with doing one GC-correction across all cells within a given sample. I've branched your code and made those edits for it to run:
https://github.com/aifimmunology/epiAneufinder/tree/optimized

If you want to test it out quickly, you can install from this branch and it should be ready to run. I just don't have the ability to easily benchmark it, but it runs much much faster.

@KatharinaSchmid
Copy link
Contributor

Thanks for your suggestion with the pseudobulk GC correction factor, this sounds like a good alternative. I will test it and let you know how well it performs on our test datasets!
Regarding your question about performance differences without GC correction: for the SNU601 dataset the correlation to the ground truth decreased from 86% to 46%. We don't have specific threshold what we would consider "an acceptable deviation", but this was clearly too bad.

@markphillippebworth
Copy link
Author

markphillippebworth commented Sep 26, 2024

Yes, I absolutely agree with you that 46% is an 'unacceptable' deviation. Good call, and thank you for sharing!

Right now, I'm testing our the method on PBMCs, but I don't have any ground truth for comparison. It looks pretty good though? I'm expecting some, but not too many alterations.

Whenever you can calculate accuracy metrics with pseudobulk GC correction, I would love to hear about it. I'd like to run your tool across several hundred scATAC-seq samples and analyze the results.

Here's an example run with Pseudobulk GC correction.
epiAneufinder(input= frag, #Enter path to your fragments.tsv file or the folder containing bam files
outdir=paste("Results/", gsub("FragmentFiles/|.tsv", "", frag) ,sep=''),
blacklist="hg38-blacklist.v2.bed.gz",
windowSize=5e5,
genome="BSgenome.Hsapiens.UCSC.hg38",
exclude=c('chrX','chrY','chrM'),
title_karyo="Karyogram of sample data",
ncores=10,
minFrags=1000,
minsizeCNV=1,
k=1,
plotKaryo=TRUE)
image

@jbwebster
Copy link

jbwebster commented Dec 16, 2024

Has any formal benchmarking been done on the GC correction modifications suggested by @markphillippebworth ? Either the use of the lowess/loess function or the use of pseudobulk calculations? They sounds very promising and a speed increase would be very welcome to our workflow, since we are also hoping to run this tool on many samples. My current attempts appear stuck at the GC bias correction step for 5 days and are still not complete.

@KatharinaSchmid
Copy link
Contributor

Hey, sorry for not coming back earlier, but we haven't found a solution for this yet. We tested the suggestion from @markphillippebworth and also other alternative GC corrections. But unfortunately, it worked only on some of our test datasets, and performed very badly on others. So for now, we can not recommend this generally. As an alternative to improve the runtime, we currently implement a python version of the package epiAneufinder, which we hope to be more efficient. Our plan is to publish this early next year and we will keep you all of course updated on this.

@jbwebster
Copy link

Thank you for the update @KatharinaSchmid !

@markphillippebworth
Copy link
Author

Yes, thank you for the update! It's quite curious that performance was highly variable across datasets. It makes me wonder if GC correction via a 8-mer match or something similar might perform better.

Regardless, I'm excited to see the run times for the python version!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants