Only Identifies Global CNVs for scATAC data #25

markphillippebworth · 2024-06-28T07:22:29Z

Hello! I've been trying to use your software, but I only ever see global CNV changes - no hint of single cell changes.

I've tried changing the windows size between 1e5, 5e5, and 1e6, the minFrags between 1000 to 1500, the minsizeCNV between 0 and 4, and yet I still get essentially identical resutls regardless. And the results don't really make sense to me. The data also comes from 10x, and I've been working with it for a while, so there's no significant issues with it.

Code:

epiAneufinder(input= pathToArrow, #Enter path to your fragments.tsv file or the folder containing bam files
outdir="epiAneufinder_results", #Path to the directory where results should be written
blacklist="hg38-blacklist.v2.bed.gz",
#Path to bed file that contains the blacklisted regions of your genome
windowSize=1e6,
genome="BSgenome.Hsapiens.UCSC.hg38", #Substitute with relevant BSgenome
exclude=c('chrX','chrY','chrM'),
reuse.existing=FALSE,
title_karyo="Karyogram of sample data",
ncores=4,
minFrags=1500,
minsizeCNV=0,
k=4,
plotKaryo=TRUE)

[1] "Removing old file from the output folder"
Subtracting Blacklist...
Adding Nucleotide Information...
1 of 22
2 of 22
3 of 22
4 of 22
5 of 22
6 of 22
7 of 22
8 of 22
9 of 22
10 of 22
11 of 22
12 of 22
13 of 22
14 of 22
15 of 22
16 of 22
17 of 22
18 of 22
19 of 22
20 of 22
21 of 22
22 of 22
[1] "Finished making windows successfully"

[1] "Obtaining the fragments tsv file"
==================================================
GRanges object with 6 ranges and 2 metadata columns:

  seqnames      ranges strand |     barcode       pcr
     <Rle>   <IRanges>  <Rle> | <character> <integer>

[1] chr1 10066-10126 * | B038 8
[2] chr1 10067-10137 * | B038 2
[3] chr1 10067-10138 * | B038 13
[4] chr1 10068-10132 * | B038 1
[5] chr1 10071-10124 * | B038 1
[6] chr1 10071-10125 * | B038 5

seqinfo: 24 sequences from an unspecified genome; no seqlengths
NULL
Getting Counts...
Counting reads from fragments/bed file ..
[1] "Count matrix with 1 cells and 4900 windows has been generated and will be saved as count_summary.rds"
Correcting for GC bias...
[1] "Filtering empty windows, 4900 windows remain."
[1] "Successfully identified breakpoints"
A .tsv file with the results has been written to disk. It contains the copy number states for each cell per bin.
0 denotes 'Loss', 1 denotes 'Normal', 2 denotes 'Gain'.
[1] "Successfully plotted karyogram"
Warning message:
In dir.create(outdir, recursive = TRUE) :
'epiAneufinder_results/epiAneufinder_results' already exists

The text was updated successfully, but these errors were encountered:

thek71 · 2024-06-28T09:38:03Z

Hi,

the karyogram you have is of only one cell.
The software does no produce global CNVs, it produces CNVs per cell.
Also, in the messages that the software prints you can see that the count matrix that is calculated includes only one cell, hence the plot.
[1] "Count matrix with 1 cells and 4900 windows has been generated and will be saved as count_summary.rds"

Are you using a fragment file as input or bam file? If it is bam file, then it should be one per cell.
If you are using a fragments file, please there might be a formatting issue.
Another possibility is that the minimum fragments variable is still too high for the dataset you have, although if you have set that to 1000 I find it unlikely.

Best,
Katia

markphillippebworth · 2024-07-01T19:44:06Z

Thank you for the quick response! You're right - that is one cell.

I was using a fragments.tsv.gz file, and I don't know why, but epiAnuefinder wasn't recognizing the cell ID column in it. When I manually subsetting the fragment file to match the exact column names from the tutorial, then it worked - I think it was looking for the cell barcode column to be in the 4th column, and it wasn't, so it was treating all the cells are one big cell. Is there a way to tell it which column names are expected for cell barcodes or something like that? Otherwise, I'll just need to manually edit each file.

Thank you again! It's neat to see it working.

thek71 · 2024-07-02T12:01:23Z

Hi,

I am glad that it worked out.
The way that epiAneufinder is designed, as you mentioned, is to parse the fragment file using the column content. In the standard 10x fragments the fourth column has the barcodes and we designed the algorithm to be compatible with the different types of data format specifications.
I am not sure how your dataset looks like, but if it is a non-standard fragment format then unfortunately you will have to convert it first.
Is your dataset coming from a different vendor that has a different format specification?

Best,
Katia

markphillippebworth · 2024-07-10T14:39:36Z

We have our own scATAC-seq processing pipeline, which comes off of 10Xs, with additional steps. That being said, I've never had an issue with ArchR. It's possible that they rely on column names, instead of column positions, when parsing these files.

I am, however, having issues with the program getting stuck on some files. It'll run overnight, stuck at the GC bias. I'm not sure why.

thek71 · 2024-07-10T16:48:58Z

Hi,

the GC bias correction is probably the most intensive step computationally.
For me depending on the dataset it might also take half a day, depending on the resources I give.
Do you see an error or it is just taking too long?
You can try with more cores.

Best,
Katia

markphillippebworth · 2024-07-10T19:08:03Z

Ah ok - I was seeing that it was taking forever, more than half a day, on 10 cores. I'll try uping the resources and the window size, so it has less to iterate over.

markphillippebworth · 2024-07-15T14:33:21Z

I tried running it over the weekend with 45 cores - and it still stuck on GC correction. I've checked CPU usage, and it's been using all 45 cores for the last two days without even completing one file.

epiAneufinder(input= paste('FragmentFiles/',gsub(".*ArrowFiles/|.arrow", "", frag),'.tsv',sep=''),
#Enter path to your fragments.tsv file or the folder containing bam files
outdir= paste("Results/", gsub(".*ArrowFiles/|.arrow", "", frag) ,sep=''),
#Path to the directory where results should be written
blacklist = "hg38-blacklist.v2.bed.gz",
#Path to bed file that contains the blacklisted regions of your genome
windowSize=1e5,
genome="BSgenome.Hsapiens.UCSC.hg38", #Substitute with relevant BSgenome
exclude=c('chrX','chrY','chrM'),
reuse.existing=FALSE,
title_karyo="Karyogram of sample data",
ncores=45,
minFrags=1000,
minsizeCNV=1,
k=1,
plotKaryo=TRUE)

Output:
Counting reads from fragments/bed file ..
[1] "Count matrix with 14940 cells and 26088 windows has been generated and will be saved as count_summary.rds"

thek71 · 2024-07-15T14:44:33Z

I can understand it, I am also running some samples and I am at day number 4.
We are planning to make the GC correction step optional, but we first need to check how the results are going to be affected, before saying that we recommend that.
It will take a bit of time, but I'll keep you posted.

Best,
Katia

markphillippebworth · 2024-07-15T16:04:44Z

That would be great, thank you! I tried to re-write part of it from mclapply to parlapply, to help eliminate some memory leak issues when highly parallelizing (and runtime), but that hasn't helped unfortunately.

markphillippebworth · 2024-07-15T16:18:07Z

Thank you for that. I think it could make sense to skip, depending on the variance in GC content across the binsize. If the binsize is large enough, I would imagine the GC content variance would be minimal.

One alternative is switching from loess to lowess for the correction, which is known to be much faster over larger datasets (10s of thousands). Have you considered that implementation? I may try it out on my end and see if it spells things out. I have a close to 100 samples I'd like to process and I can't wait for 4 days per sample.
https://stats.stackexchange.com/questions/161069/difference-between-loess-and-lowess#:~:text=lowess%20and%20loess%20count%20iterations,i.e.%2C%20iterations%3Diter%2B1

thek71 · 2024-07-15T16:29:54Z

Thank you for the recommendation.
We can have a look and have it as s possible option for cases with large number of samples.
Of course we will need again to test the performance both in terms of resources but also on overall performance.

Best,
Katia

markphillippebworth · 2024-08-02T00:13:09Z

Is there a reason you did a loess fit for each cell individually when correcting for GC bias?
Theoretically, all the cells would have been processed and sequenced together. You could just collapse all the cells together into one pseudobo sample, run your loess fit to generate your gc correction score once, and then apply that to all your cells evenly.

It's not like you're going to have cell-specific GC bias when sequencing.

thek71 · 2024-08-02T06:49:27Z

We are currently trying out different things in order to boost the performance.
We have already tried skipping the GC correction, but the results compared to the groundtruth were far worse than what we would accept.
We have some other things in mind to try.
Thank you once again for your recommendation. We will add that in our list.

markphillippebworth · 2024-08-03T00:10:55Z

Sounds good! Please keep us abreast of what you figure out. I am curious what you considerd to be unacceptable deviations from ground truth.

I'm experimenting with doing one GC-correction across all cells within a given sample. I've branched your code and made those edits for it to run:
https://github.com/aifimmunology/epiAneufinder/tree/optimized

If you want to test it out quickly, you can install from this branch and it should be ready to run. I just don't have the ability to easily benchmark it, but it runs much much faster.

KatharinaSchmid · 2024-08-07T13:48:33Z

Thanks for your suggestion with the pseudobulk GC correction factor, this sounds like a good alternative. I will test it and let you know how well it performs on our test datasets!
Regarding your question about performance differences without GC correction: for the SNU601 dataset the correlation to the ground truth decreased from 86% to 46%. We don't have specific threshold what we would consider "an acceptable deviation", but this was clearly too bad.

markphillippebworth · 2024-09-26T19:31:58Z

Yes, I absolutely agree with you that 46% is an 'unacceptable' deviation. Good call, and thank you for sharing!

Right now, I'm testing our the method on PBMCs, but I don't have any ground truth for comparison. It looks pretty good though? I'm expecting some, but not too many alterations.

Whenever you can calculate accuracy metrics with pseudobulk GC correction, I would love to hear about it. I'd like to run your tool across several hundred scATAC-seq samples and analyze the results.

Here's an example run with Pseudobulk GC correction.
epiAneufinder(input= frag, #Enter path to your fragments.tsv file or the folder containing bam files
outdir=paste("Results/", gsub("FragmentFiles/|.tsv", "", frag) ,sep=''),
blacklist="hg38-blacklist.v2.bed.gz",
windowSize=5e5,
genome="BSgenome.Hsapiens.UCSC.hg38",
exclude=c('chrX','chrY','chrM'),
title_karyo="Karyogram of sample data",
ncores=10,
minFrags=1000,
minsizeCNV=1,
k=1,
plotKaryo=TRUE)

jbwebster · 2024-12-16T15:45:45Z

Has any formal benchmarking been done on the GC correction modifications suggested by @markphillippebworth ? Either the use of the lowess/loess function or the use of pseudobulk calculations? They sounds very promising and a speed increase would be very welcome to our workflow, since we are also hoping to run this tool on many samples. My current attempts appear stuck at the GC bias correction step for 5 days and are still not complete.

KatharinaSchmid · 2024-12-17T14:02:52Z

Hey, sorry for not coming back earlier, but we haven't found a solution for this yet. We tested the suggestion from @markphillippebworth and also other alternative GC corrections. But unfortunately, it worked only on some of our test datasets, and performed very badly on others. So for now, we can not recommend this generally. As an alternative to improve the runtime, we currently implement a python version of the package epiAneufinder, which we hope to be more efficient. Our plan is to publish this early next year and we will keep you all of course updated on this.

jbwebster · 2024-12-17T16:30:48Z

Thank you for the update @KatharinaSchmid !

markphillippebworth · 2024-12-19T19:24:35Z

Yes, thank you for the update! It's quite curious that performance was highly variable across datasets. It makes me wonder if GC correction via a 8-mer match or something similar might perform better.

Regardless, I'm excited to see the run times for the python version!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Only Identifies Global CNVs for scATAC data #25

Only Identifies Global CNVs for scATAC data #25

markphillippebworth commented Jun 28, 2024

thek71 commented Jun 28, 2024

markphillippebworth commented Jul 1, 2024 •

edited

Loading

thek71 commented Jul 2, 2024

markphillippebworth commented Jul 10, 2024

thek71 commented Jul 10, 2024

markphillippebworth commented Jul 10, 2024 •

edited

Loading

markphillippebworth commented Jul 15, 2024 •

edited

Loading

thek71 commented Jul 15, 2024

markphillippebworth commented Jul 15, 2024

markphillippebworth commented Jul 15, 2024

thek71 commented Jul 15, 2024

markphillippebworth commented Aug 2, 2024

thek71 commented Aug 2, 2024

markphillippebworth commented Aug 3, 2024

KatharinaSchmid commented Aug 7, 2024

markphillippebworth commented Sep 26, 2024 •

edited

Loading

jbwebster commented Dec 16, 2024 •

edited

Loading

KatharinaSchmid commented Dec 17, 2024

jbwebster commented Dec 17, 2024

markphillippebworth commented Dec 19, 2024

Only Identifies Global CNVs for scATAC data #25

Only Identifies Global CNVs for scATAC data #25

Comments

markphillippebworth commented Jun 28, 2024

[1] chr1 10066-10126 * | B038 8 [2] chr1 10067-10137 * | B038 2 [3] chr1 10067-10138 * | B038 13 [4] chr1 10068-10132 * | B038 1 [5] chr1 10071-10124 * | B038 1 [6] chr1 10071-10125 * | B038 5

thek71 commented Jun 28, 2024

markphillippebworth commented Jul 1, 2024 • edited Loading

thek71 commented Jul 2, 2024

markphillippebworth commented Jul 10, 2024

thek71 commented Jul 10, 2024

markphillippebworth commented Jul 10, 2024 • edited Loading

markphillippebworth commented Jul 15, 2024 • edited Loading

thek71 commented Jul 15, 2024

markphillippebworth commented Jul 15, 2024

markphillippebworth commented Jul 15, 2024

thek71 commented Jul 15, 2024

markphillippebworth commented Aug 2, 2024

thek71 commented Aug 2, 2024

markphillippebworth commented Aug 3, 2024

KatharinaSchmid commented Aug 7, 2024

markphillippebworth commented Sep 26, 2024 • edited Loading

jbwebster commented Dec 16, 2024 • edited Loading

KatharinaSchmid commented Dec 17, 2024

jbwebster commented Dec 17, 2024

markphillippebworth commented Dec 19, 2024

[1] chr1 10066-10126 * | B038 8
[2] chr1 10067-10137 * | B038 2
[3] chr1 10067-10138 * | B038 13
[4] chr1 10068-10132 * | B038 1
[5] chr1 10071-10124 * | B038 1
[6] chr1 10071-10125 * | B038 5

markphillippebworth commented Jul 1, 2024 •

edited

Loading

markphillippebworth commented Jul 10, 2024 •

edited

Loading

markphillippebworth commented Jul 15, 2024 •

edited

Loading

markphillippebworth commented Sep 26, 2024 •

edited

Loading

jbwebster commented Dec 16, 2024 •

edited

Loading