Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why those cells are unassigned? #106

Open
tewengtong opened this issue Jul 30, 2024 · 3 comments
Open

Why those cells are unassigned? #106

tewengtong opened this issue Jul 30, 2024 · 3 comments

Comments

@tewengtong
Copy link

Dear team,

Thank you for providing the tool to demultiplex. There are 16 donors in one pool in our data. We have paired genotype to demultiplex. My input vcf contains 10w exonic SNPs. I ran Vireo with Mode 1a, it seems there are still some unassigned cells. Some of them are low-coverage cells, but others look fine to me according to their metrics:

[vireo] Loading cell folder ...
[vireo] Loading donor VCF file ...
[vireo] 106759 out 106759 variants matched to donor VCF
[vireo] Demultiplex 27546 cells to 16 donors with 106759 variants.
[vireo] lower bound ranges [-11356250.4, -11356250.4, -11356250.4]
[vireo] allelic rate mean and concentrations:
[[0.086 0.456 0.857]]
[[13504538.2 8695336.1 4566475.7]]
[vireo] donor size before removing doublets:
donor0 donor1 donor2 donor3 donor4 donor5 donor6 donor7 donor8 donor9 donor10 donor11 donor12 donor13 donor14 donor15
2362 2047 3153 805 694 2378 672 2050 1355 2735 1024 2085 1059 1910 2332 885
[vireo] 49612 out 106759 SNPs selected for ambient RNA detection: ELBO_gain > 55.3
[vireo] Ambient RNA time: 29202.8 sec
[vireo] final donor size:
H073 H078 H111 H134 H145 H152 H204 H288 H345 H346 H351 H356 H358 H360 H388 H389 doublet unassigned
1165 1636 1382 1728 1223 363 1156 1543 1135 364 1491 546 752 679 496 316 7276 4295
[vireo] All done: 488 min 57.0 sec

unassigned.filter[1:20,c("donor_id","prob_max","prob_doublet","n_vars")]
donor_id prob_max prob_doublet n_vars
AAACCGACAAATTGCC-1 unassigned 3.20e-01 0.680 1438
AAACGTAAGCTCGTGG-1 unassigned 6.33e-01 0.253 349
AAACTAGAGGCCCTGA-1 unassigned 7.44e-04 0.499 209
AAAGCCGAGCGAATTT-1 unassigned 3.90e-17 0.544 928
AAAGCCTGTATTGAGC-1 unassigned 1.70e-01 0.264 126
AAAGCGAGTACCTTGC-1 unassigned 1.28e-01 0.640 205
AAAGCGTAGCGCACAT-1 unassigned 3.99e-03 0.519 282
AAAGCTAAGCCCTAGG-1 unassigned 6.60e-01 0.340 1077
AAAGGAGAGAAACGCG-1 unassigned 2.30e-01 0.250 105
AAAGGCCCACCGCTAA-1 unassigned 3.27e-01 0.245 80
AAAGGGGCAGCGGTAG-1 unassigned 6.71e-01 0.315 217
AAAGGTATCCAGGATC-1 unassigned 1.61e-125 0.662 4138

My question is: Why those cells are still estimated as unassigned cells even if the n_vars are high? Are those doublets? Should I include those cells for downstream analysis or just delete those cells (should around ~2,000 cells, 10% of the whole library)?

Thank you so much!

@huangyh09
Copy link
Collaborator

Hi, thanks for sharing your experience. Indeed, this looks a bit intriguing. Overall, it looks like working reasonably with having the mean allelic rate of [0.086 0.456 0.857] (while it can be closer to the theoretical value, especially 0.857).

In your experiment setting (if assuming a conventional doublet rate, 27K cells for 27% double rate), you may expect a bit more doublets but recent techniques may be improved. Nonetheless, the unassigned cells are indeed too many. Possibly, you may compare the estimated genotype (with mode 4 adding --forceLearnGT) to the known genotype and see if there is any obvious mismatch between donors.

You can turn off the --callAmbientRNAs to save the major running time.

Yuanhua

@tewengtong
Copy link
Author

tewengtong commented Aug 6, 2024

Dear Yuanhua,

Thank you so much for your patience and suggestions. Indeed, we applied 10X GEM-X 5' library kit for our sample. According to 10X, the doublet rate could be half. However, our samples are human adipose tissue, which may expect to have more doublets.

According to your suggestions. I tried both mode 2 (with GT for all donors) and mode 4(with --forceLearnGT).

For mode 2 my script is:
vireo -c "./B1_L1/" -d "filtered_16_samples_GT.vcf.gz" -o "v7" -t GT

For mode 4:
vireo -c "./B1_L1/" -d "filtered_16_samples_GT.vcf.gz" -o ./B1_L1/" -t GT --forceLearnGT

Mode 2 identified 4162 unassigned cells and mode 4 identified 3484.

As you suggested to compare their genotype, indeed, I do find several variants are not consistent in the estimated genotype and my known genotype. Could you please give me some suggestions on:

  1. Does it mean my genotype file is problematic?
  2. Which one should I refer to or what can I do?

Also, I tried to check the potential reasons myself in several ways:

  1. I compared the matrices distribution generated from two modes:
image

It seems there are neglectable differences between the results.

  1. Then, I looked at the unassigned identified by mode2 but not in mode 4:
image

It seems that mode 4 returned more "symmetrical" and "extreme" results compared to mode 2. Do you know why is that and what can i do?

  1. I also checked what kind of unassigned cells identified by mode2 are, and I found that:
  1. 42.8% of them are cells with low count or low genes expressed or high mitochondrial genes.
  1. 14.3% of them has a contamination rate > 0.5. Since our samples are from human adipose tissue, i think contamination may be a reason.
  1. Other 43% cells: I have no idea what they are:
image Do you have any idea about this plot? Do you think I can set a threshold for singlets and doublets at 0.7 instead of 0.9?

Thank you so much for your patience! You opinion values a lot to us!

Best,
Yihan

@huangyh09
Copy link
Collaborator

Hi, thanks for the detailed follow-up and sorry for the delay. Unfortunately, I don't have concrete answers. Here are some thoughts:

  • It's good to see that modes 2 and 4 are consistent by large, suggesting the overall quality and settings should be fine.

  • For point 2, I guess it's not too surprising, as you focused on unassigned cells in mode 2. So you would see a proportion of them are rescued by mode 4 but the others are probably genuine doublets.

  • I'm still a bit concerned about the inferred allelic rate parameter (0.857); it's in a reasonable range but generally, if it's 0.9-0.95 (or even closer to 1) would be better, meaning the REF-REF (homogenous genotypes) are reliable. Considering the number of donors in the poor is not small, it may make the doublets detection a bit more difficult. If you need more cells, you may manually lower the threshold of doublet probability. Also, mode 4 should be OK to use.

Yuanhua

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants