Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create C4R repeat expansion database #3

Open
Madelinehazel opened this issue Mar 5, 2024 · 0 comments
Open

Create C4R repeat expansion database #3

Madelinehazel opened this issue Mar 5, 2024 · 0 comments
Assignees

Comments

@Madelinehazel
Copy link
Collaborator

Madelinehazel commented Mar 5, 2024

We use TRGT to call short repeat expansions from PacBio HiFi genomes. We would like to develop a 'database' that tells us the minimum and maximum repeat size we've seen in our samples for each repeat locus.

This will be a text file that contains the minimum and maximum repeat size for each repeat, as well as sample names corresponding to those carrying the minimum and maximum allele sizes. Example line:

repeat min_size max_size min_size_sample max_size_sample
chr10_100000834_100000912_A 50 80 HG00639 HG00099

You may generate this from HPRC (human pangenome reference consortium) and C4R TRGT VCFs in: /hpf/largeprojects/ccmbio/ccmmarvin_shared/pacbio_longread/TRGT/proband_only_workflow/HPRC-C4R-VCFs. OR, generate from this text file: /hpf/largeprojects/ccmbio/mcouse/pacbio_report_dev/results/test_outlier_expansions_full/repeat_outliers/sorted_alleles_db.gz, which was derived from the HPRC and C4R VCFs (probably easiest to start here)

See section 'Find outliers' in this notebook for possible inspiration on how to iterate through/handle the file.

Note: HPRC TRGT VCFs came from Egor. We do not have HPRC BAMs on the hpf.

@r-varan r-varan self-assigned this Apr 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants