When building somatic mutation detection tools and pipelines, it is critical to find well characterized references to serve as the "ground truth" against which mutation calls can be evaluated and optimized. The Genome in a Bottle Consortium has characterized a number of germline samples that have proven extremely valuable. Many germline variant detection pipelines and algorithms have relied on them to train and tune their algorithms, e.g., DeepVariant. However, rigorously characterized genome-wide reference samples and call sets did not exist for somatic mutations until the work by SEQC2's Somatic Mutation Working Group. The following are some of the working group's publications:
-
Fang L.T. et al. Nat Biotechnol (2021) described the methods to establish the high-confidence somatic mutation call set and its corresponding high-confidence regions for a pair of tumor-normal cancer cell lines (HCC1395 vs. HCC1395BL). The genomic DNA was produced in a single batch by ATCC to ensure sample homogeneity. The high-confidence call set was consolidated from 20+ whole genome sequencing replicates from multiple sequencing centers and platforms to a combined 1500X sequencing depth. Three read aligners (i.e., BWA MEM, NovoAlign, and Bowtie2), six somatic mutation callers (i.e., MuTect2, SomaticSniper, VarDict, MuSE, Strelka2, and TNscope), and two machine learning algorithms (SomaticSeq and NeuSomatic) were used to create the high-confidence call set. More detailed documentation can be found here. The high-confidence call set and regions have been used in a number of companion studies. DOI:10.1038/s41587-021-00993-6 / PMID:34504347 / SharedIt Link / Youtube presentation
-
Xiao W. et al. Nat Biotechnol (2021) used reference data and call sets described above to investigate how different experimental and bioinformatic factors affect the accuracies of somatic mutation detections in whole-genome and whole-exome sequencings. The experimental factors that were investigated include DNA input amounts, fresh vs. FFPE samples (and different formalin fixation durations), fragmentation properties, sequencing libraries, sequencing depths, and tumor purities. Bioinformatic factors included the choices of read aligners, variant callers, quality trimming, error correction, and post-alignment processing (i.e., indel realignment and base quality score recalibration). DOI:10.1038/s41587-021-00994-5 / PMID:34504346 / SharedIt Link / Youtube presentation
-
Sahraeian S.M.E. et al. Genome Biol (2022) used the sequencing data and high-confidence somatic mutation call set to build and optimize deep learning models that accurately detect somatic mutations. DOI:10.1186/s13059-021-02592-9 / PMID:34996510 / Youtube presentation
-
Zhao Y. et al. Sci Data (2021) is the data descriptor for all the sequencing data on SRA:SRP162370 generated by the working group for the pair of tumor-normal reference samples. The multi-center data sets included platforms such as Illumina NovaSeq, Illumina HiSeq, PacBio Sequel, Ion Torrent, Oxford Nanopore, and 10X Chromium platforms. Sequencing data generated with all the different experimental factors described above are also included. For your convenience, some of the BWA MEM aligned BAM files can be downloaded at NCBI's FTP site. DOI:10.1038/s41597-021-01077-5 / PMID:34753956
- SEQC2 Collection on Nature Biotechnology
- SEQC2 Collection on Genome Biology