Skip to content

Latest commit

 

History

History
104 lines (83 loc) · 13 KB

README.md

File metadata and controls

104 lines (83 loc) · 13 KB

Telomere-to-Telomere consortium primates project

T2T-Primates is a project of the Telomere-to-Telomere consortium and is led by the Makova, Phillippy, and Eichler labs. The project seeks to finish complete, diploid assemblies for key non-human primate species. The project is currently focused on gorilla, bonobo, chimpanzee, orangutan, and gibbon. Following the approach of the human T2T-CHM13 project, all species have been sequenced with high-coverage PacBio HiFi (>50x) and Oxford Nanopore ultra-long 100 kb+ (>30x) sequencing reads. For haplotype phasing, Dovetail Hi-C data was generated for all genomes and Strand-seq data is also expected. Parental Illumina data was collected for bonobo and gorilla, where familial trios were available.

Phase one of the project focused on completing the sex chromosomes (v1 release), and phase two focused on finishing the autosomes (v2 release). Version 2 assemblies for all species are now available, both here and via GenBank/RefSeq. See below for publications detailing our initial analyses of these assemblies.

Data reuse and license

All data is released to the public domain (CC0) and we encourage its reuse. However, we are in the process of finishing and analyzing these genomes, so to avoid duplicating effort, we encourage you to contact us if you are interested in contributing. The following working groups have been formed: assembly, annotation, sex chromosomes, comparative and evolutionary genomics, segmental duplications, acrocentric chromosomes and rDNAs, satellite DNAs, mobile elements, and pangenomics.

Relevant citations:

  1. Yoo D, et al. Complete sequencing of ape genomes. BioRxiv, 2024
  2. Makova K, et al. The Complete Sequence and Comparative Analysis of Ape Sex Chromosomes. Nature, 2024.

Data Availability

The raw genome sequencing data generated by this study are available under NCBI BioProjects, PRJNA602326, PRJNA976699–PRJNA976702, and PRJNA986878–PRJNA986879 and transcriptome data are deposited under BioProjects, PRJNA902025 (UW Iso-Seq) and PRJNA1016395 (UW and PSU Iso-Seq and short-read RNA-seq). The genome assemblies are available from GenBank under accessions: GCA_028858775.2, GCA_028878055.2, GCA_028885625.2, GCA_028885655.2, GCA_029281585.2 and GCA_029289425.2. Genome assemblies can be downloaded via NCBI.

A UCSC Browser hub is available including genome-wide alignments, CAT annotations, methylation, and various other annotation and analysis tracks used in this study. The T2T-CHM13v2.0 and HG002v1.0 assemblies used here are also available via the same browser hub, and from GenBank via accessions GCA_009914755.4 (T2T-CHM13), GCA_018852605.1 (HG002 paternal), and GCA_018852615.1 (HG002 maternal). The alignments are available to download or browse in HAL118 MAF and UCSC Chains formats.

Assembly releases

v2.0-v2.1 (November 2023 - May 2024)

Version 2 diploid assemblies were generated by Verkko with additional finishing and polishing steps to reach T2T. Chromosomes were named and oriented according to the prior cytogenetics literature for each species. For convenience, the "hsa" suffix in the chromosome names refers to the human homologous chromosome, where applicable. Gorilla and bonobo were phased using familial trios, and so complete maternal and paternal haplotypes are available for these species. All other species were phased using Hi-C. In the case of Hi-C phasing, each chromosome is completely phased, but it is not known which comes from the maternal or paternal haplotype, so the higher quality haplotype was assigned to hap1 and the lower quality haplotype to hap2. All assemblies have been submitted to NCBI GenBank and are currently being processed. The curated and submitted versions can be downloaded from AWS in a variety of configurations:

There are a number of files within these directories with the following tags:

  • dip : diploid assembly including both haplotypes
  • analysis-dip : diploid assembly + MT + rDNA morph + EBV contigs
  • pri : "Primary linear haplotype". higher quality haplotype per chromosome (hap1 in non-trios) + ChrXY
  • alt : "Alternate haplotype". equal or lower quality haplotype (hap2 in non-trios) with no ChrXY
  • mat/pat : maternal and paternal haplotypes, with chrX in mat and chrY in pat
  • hap1/hap2 : hap1 and hap2 haplotypes, with chrX in hap1 and chrY in hap2
  • chrEBV/MT/rDNA : consensus EBV, mitochondria, and rDNA contigs
  • unloc : any unlocalized sequences from unresolved gaps

Files with the date tag 20231122 and 20231205 are the v2.0 assemblies that were initially submitted to GenBank. To serve as a linear reference genome, a haploid “primary” assembly was selected from the diploid assembly of each species. For each chromosome, the most complete and accurate chromosome was selected for each chromosome pair. When rDNA was present in only one haplotype, it was chosen as the primary haplotype regardless of the completeness status. Both diploid and primary assemblies were submitted, but only the primary assemblies containing both chrX and chrY will be annotated and serve as a linear reference for each species. All primary haplotypes are in "T2T" status (gapless and complete, telomere on both ends, higher accuracy) with the exception of the large rDNA arrays; one additional gap in mPanPan1 chr22_pat_hsa21, mPonAbe1 chr18_hap1_hsa16, and mPonAbe1 chr1_hap1_hsa1; and one missing telomere from mPonPyg2 chr21_hap1_hsa20.

*Symphalangus syndactylus (mSymSyn1, siamang gibbon) has been updated to v2.1 with date tag 20240514 and updated accordingly on GenBank. The only change from v2.0 is between Chromosomes 12 and 19, which the chromosome labels were swapped to match prior chromosome assignment of this species.

v1.0 (December 2022)

Version 1 diploid assemblies were generated with Verkko, and contigs were chromosome-assigned and oriented by alignment to the previous references. Both X and Y chromosomes are complete for all species listed. Gorilla and bonobo were phased using familial trios, and all others using Hi-C. To avoid confusion, we have removed links to these assemblies, but they still exist in the AWS bucket.

Downloads

All generated sequencing data and assemblies are available for browsing and download from GenomeArk.

Prior assembly versions

Notes on downloading files

Files are generously hosted by Amazon Web Services under s3://genomeark. Although available as HTTP links above, download performance is improved by using the Amazon Web Services command-line interface. References should be amended to use the s3:// addressing scheme. Amending the max_concurrent_requests etc. settings as per this guide will improve download performance further.

Code Availability

Custom scripts used for the v2 assemblies are listed as below:

In addition to the custom scripts, the following codes were used:

Contact

For any problems related to this dataset, please raise issues on this GitHub repository. For general questions regarding the project, please contact [email protected]. More information about our consortium can be found on the T2T homepage.

History

* Dec 2022. v1 release.
* Nov 2023. v2 release.
* Dec 2023. hap1 hap2 swapped in mPonAbe1 chr14 (hsa13) and mSymSyn1 chr3 to keep the rDNA containing or higher quality haplotype in hap1 and in the primary assembly.
* May 2024. mSymSyn1 v2.1 release. Chr12 and Chr19 are swapped to follow prior chromosome assignments for this species.