Skip to content

4.1.0.0

Compare
Choose a tag to compare
@droazen droazen released this 30 Jan 03:38
· 985 commits to master since this release

It's been a year since the GATK 4.0.0.0 release in January 2018, and we decided that it was time to package up the past year's worth of GATK improvements into a new major release, which we're calling version 4.1.0.0!

To commemorate this milestone, we'll be publishing a series of in-depth technical articles and blog posts covering the major new features in version 4.1.0.0 on the official GATK blog.

Below we've compiled the highlights of the new features added between versions 4.0.0.0 and 4.1.0.0. If you're interested in seeing only the changes between the last release (4.0.12.0) and this release (4.1.0.0), click here instead.

Official docker image is here: https://hub.docker.com/r/broadinstitute/gatk/

Major changes between versions 4.0.0.0 and 4.1.0.0 (January 2018 to January 2019):


  • Next-Gen VQSR Replacement For Single-Sample

    • New suite of tools CNNScoreVariants, CNNVariantTrain, CNNVariantWriteTensors, and FilterVariantTranches
    • CNNScoreVariants is now out of beta and ready for production use
    • Performs variant training and scoring using a convolutional neural network.
    • Single-sample only
    • Produces better results than the legacy VariantRecalibrator (VQSR) and comparable or better results to third-party tools like DeepVariant
    • Sophisticated 2D model that uses the reads
  • Major HaplotypeCaller Improvements

    • Now genotypes and outputs spanning deletions
    • Now outputs VCF spec-compliant phased variants
    • Can emit MNPs via a new --max-mnp-distance argument
    • Important fix to the reference confidence calculation upstream of indels
    • New HaplotypeCaller priors for variants sites and homRef blocks
      • Added new --population-callset argument allowing an external panel of variants to be specified to inform the frequency distribution underlying the genotype priors
      • Added new --num-reference-samples-if-no-call argument to control whether to infer (and with what effective strength) that only reference alleles were observed at sites not seen in any panel
  • Major Mutect2 Improvements

    • Mutect2 is now out of beta
    • Support for multi-sample calling
    • Lots of support for high-depth calling such as cfDNA, UMIs, mitochondria, including a new active region likelihood, probabilistic assembly graph pruning that adjusts to the local depth, a new mitochondria mode, and new filters for blood biopsy and mitochondria
    • Now outputs VCF spec-compliant phased variants
    • Can emit MNPs via a new --max-mnp-distance argument
    • Added a genotype given alleles (GGA) mode
    • New STR indel error model that improves sensitivity and precision in STR (short-tandem repeat) contexts
    • Many new/improved filters to reduce false positives (eg., FilterAlignmentArtifacts)
    • Mutect2 now automatically recognizes and removes end repair artifacts in regions with inverted tandem repeats. This is extremely important for some FFPE samples.
    • New probabilistic orientation bias tool
    • Got rid of many questionable indels showing up in bamout of Mutect2 and the HaplotypeCaller
    • Big improvements to CalculateContamination, especially when tumor has lots of CNVs
    • NIO support in Mutect2 WDL
    • Significant speed improvements
    • Improved allele fraction estimation
    • Initial GVCF output support
  • Mitochondrial Calling

    • Added --mitochondria-mode to Mutect2 and FilterMutectCalls. This increases sensitivity and only applies filters that are optimized for mitochondria.
  • New allele frequency / qual score model

    • Is now the default in HaplotypeCaller and GenotypeGVCFs
    • Optimized for greater speed, should resolve many GenotypeGVCFs memory issues
    • Rare numerical finite precision issues in the allele-specific qual have been resolved
  • Major Improvements to the CNV (Copy Number Variation) tools

    • The CNV tools are now out of beta.
      • This includes the tools: AnnotateIntervals, CallCopyRatioSegments, CollectAllelicCounts, CollectReadCounts, CreateReadCountPanelOfNormals, DenoiseReadCounts, DetermineGermlineContigPloidy, FilterIntervals, GermlineCNVCaller, ModelSegments, PostprocessGermlineCNVCalls, PreprocessIntervals, PlotDenoisedCopyRatios, and PlotModeledSegments
    • Completed the GermlineCNVCaller (gCNV) pipeline and made various performance/runtime improvements to both the methods and WDLs.
    • Major changes include the addition of new tools (PostprocessGermlineCNVCalls, FilterIntervals, and CollectReadCounts, which replaces CollectFragmentCounts), as well as improvements to existing tools (notably, AnnotateIntervals).
    • Improved support for various formats, namely VCF output in the gCNV pipeline, IGV-compatible .seg output in the ModelSegments somatic CNV pipeline, and CRAM support for all CNV WDLs.
    • Developed tools and WDLs for tagging and filtering of germline events in the ModelSegments somatic CNV pipeline.
  • Funcotator Official Release

    • Funcotator is now out of beta
    • Huge number of bug fixes and accuracy improvements. Output for several fields is now more correct than other well-known functional annotation tools.
    • Some new features include:
      • MAF output support
      • NIO support for datasources
      • gnomAD support
      • dbsnp support
      • Support for Mitochondrial amino acid sequence/protein change strings
      • 5'/3' flank support
      • Major performance improvements due to added caching
      • Added ALL mode for transcript selection (--transcript-selection-mode ALL) which will output full annotation fields for all transcripts
    • Created a new FuncotatorDataSourceDownloader tool to download data sources
    • Added an experimental FilterFuncotations tool
  • MarkDuplicatesSpark is now a Validated, Scalable Replacement for MarkDuplicates

    • MarkDuplicatesSpark is now out of beta
    • Rewritten version of the tool matches Picard MarkDuplicates output and has greatly improved performance and scalability
    • Supports multiple BAM inputs
    • Indexes BAM outputs on-the-fly in parallel on a cluster
  • Additional Tools Ported from GATK3

    • Ported VariantAnnotator
    • Ported VariantEval
    • Ported FastaAlternateReferenceMaker and FastaReferenceMaker
    • Ported LeftAlignAndTrimVariants
    • Restored GenotypeGVCFs --include-non-variant-sites argument
  • Major Improvements to the SV (Structural Variation) Tools

    • Improvements to collection and calling of events based on discordant read pair evidence.
    • A new scaffolding algorithm greatly improves the contiguity of local assemblies, increasing sensitivity.
    • Regions of excessive sequencing depth are excluded from evidence collection and assembly, improving runtime performance.
    • A major overhaul of our algorithm for calling events based on local assemblies improves accuracy and allows for the accurate reporting of small complex SVs.
    • A machine learning (xgBoost) based classifier for SV evidence improves runtime and increases accuracy by determining which regions should be fed into the local assembly workflow.
  • Spark Improvements

    • New Disq Spark library allows faster and more accurate loading of formats like BAM and VCF
    • HaplotypeCallerSpark now has a "strict mode" that closely matches the regular HaplotypeCaller
    • Created RevertSamSpark, a parallelized Spark version of Picard's RevertSam tool
    • Migrated most Spark tools that take a reference and/or VCF to use Spark's intrinsic file copying mechanism instead of broadcast to distribute the reference and VCFs to worker nodes -- a big performance win!
  • GenomicsDB Improvements

    • Allele-specific annotation support
    • Multi-interval support (with some performance caveats)
    • Support for sites-only queries
    • Support for returning the GT field in queries
    • New protobuf-based API to allow configuration without editing JSON files
    • Added in machinery to allow per-annotation combine operations to be specified
    • Allow for hdfs and gcs URI's to be passed to GenomicsDB
    • Migrated from com.intel.genomicsdb to org.genomicsdb
  • "Goodies" Worth Mentioning

    • Added fasta.gz support to the -R/--reference argument in walker tools
    • SelectVariants can now drop specific annotation fields from the output vcf
    • CalculateGenotypePosteriors now supports indels
    • New tool ReblockGVCF to merge reference blocks in single-sample GVCFs for smaller filesizes
    • Improved MQ calculation accuracy, especially at sites with many uninformative reads; concomitant with new annotation tag and format
    • The -L argument now supports GCS (Google Cloud Storage) for interval list files / bed / vcf files in walker tools
    • Added support for "Requester Pays" GCS (Google Cloud Storage) buckets via new --gcs-project-for-requester-pays argument
    • Added GCS (Google Cloud Storage) output (-O) support to more tools
    • Improved Python integration (eliminated timeouts and reliance on prompt synchronization) means fewer glitches during runs of ML-based tools
    • A significantly (~33%) smaller GATK docker image
    • Changed argument tagging syntax from "--arg tag:value" to "--arg:tag value"
      • Affects command-line interface for VariantRecalibrator, VariantEval, VariantFiltration, and VariantAnnotator

Changes between versions 4.0.12.0 and 4.1.0.0 only:


  • Many tools are now out of beta and ready for production use!

    • CNNScoreVariants is out of beta (#5548)
    • Funcotator and FuncotatorDataSourceDownloader are out of beta (#5621)
    • MarkDuplicatesSpark is out of beta (#5603)
    • CNV tools are out of beta (#5596). This includes: AnnotateIntervals, CallCopyRatioSegments, CollectAllelicCounts, CollectReadCounts, CreateReadCountPanelOfNormals, DenoiseReadCounts, DetermineGermlineContigPloidy, FilterIntervals, GermlineCNVCaller, ModelSegments, PostprocessGermlineCNVCalls, PreprocessIntervals, PlotDenoisedCopyRatios, and PlotModeledSegments
  • New tools:

    • Added ports of FastaAlternateReferenceMaker and FastaReferenceMaker from GATK3 (#5549)
    • RevertSamSpark: a parallelized, Spark-based implementation of RevertSam from Picard (#5395)
    • CompareIntervalLists: simple new tool to compare interval lists (#3702)
    • CountBasesInReference: simple new tool to count bases in a reference file (#5549)
    • PrintBGZFBlockInformation: a tool to dump information about blocks in a BGZF file (#4239)
  • Mutect2

    • Mutect2 now works with multiple tumor and normal samples! (#5560)
    • First iteration of a reference confidence GVCF-like output for Mutect2 to enable mitochondrial joint calling (#5312)
    • Changed default blocking and NON-REF LOD params for Mutect2 GVCF mode (#5615)
    • Changed defaults for mitochondria mode now that we have adaptive pruning (#5544)
    • Fixed an edge case bug when Mutect2 sees a variant with population AF = 1 (#5535)
    • Fixed an edge case of zero-depth in FilterMutectCalls germline filter (#5578)
    • Fixed an edge case for the Mutect2 germline resource (#5563)
    • Tweaked the Mutect2 germline filter (#5595)
    • Put new orientation bias model in Mutect2 NIO wdl (#5580)
    • Improve proposed tumor in normal docs to account for new Mutect2 options (#5555)
  • Added a copy of the mitochondria best practices pipeline (#5566) (#5612)

  • HaplotypeCaller

    • New allele frequency / qual score model is now the default in HaplotypeCaller and GenotypeGVCFs (#5484)
    • Simplified and sped KBestHaplotypeFinder by replacing recursion with Dijkstra's algorithm (#5462) (#5554)
    • Forward input BAM @pg header lines to -bamout output BAM (#3065)
    • Small performance improvement in GVCF mode (#5470)
  • CNV Tools

    • Out of beta, as mentioned above! (#5596)
    • Added per-sample denoised coverage output to gCNV (#5584)
    • ModelSegments: Added separate allele-count thresholds for the normal and tumor (#5556)
    • ModelSegments: Added MinibatchSliceSampler and replaced naive subsampling (#5575)
    • Restored array output in gCNV WDLs for efficient postprocessing. (#5490)
  • Changed tagged argument syntax from --argument tag:value to --argument:tag value (#5526)

    • For example, --resource known,known=true,prior=10.0:myFile becomes --resource:known,known=true,prior=10.0 myFile
    • This change affects VariantRecalibrator, VariantEval, VariantFiltration, and VariantAnnotator
  • Funcotator

    • Out of beta, as mentioned above! (#5621)
    • New datasource release that fixes many issues and adds gnomAD support (#5614)
    • VCF Data Sources now preserve the FILTER field (#5598)
    • Funcotator now gets the NCBI build version from the datasource config file (#5522)
    • Funcotator now ignores transcript version numbers when matching on transcript ID (#5557)
    • Funcotator now uses the GATK-wide version number (#5520)
    • Updated Funcotator tool documentation (#5620)
  • MarkDuplicatesSpark

    • Out of beta, as mentioned above! (#5603)
    • Added the ability for MarkDuplicatesSpark to accept multiple bam inputs (#5430)
    • Fixed MarkDuplicateSpark mutex argument references (#5538)
  • Spark tools

    • Support for distributed BAI index creation, and option for enabling or disabling writing BAI and SBI files on Spark (#5485)
    • Get HaplotypeCallerSpark "strict mode" running on an exome (#5475)
    • Added an option for enabling or disabling writing tabix indexes for bgzipped VCF files from Spark (#5574)
    • Fixed overflow bug in GatkSparkTool.getRecommendedNumReducers() (#5586)
  • GenomicsDB

    • Migrated from com.intel.genomicsdb to org.genomicsdb (#5587) (#5608)
    • GenomicsDB now matches CombineGVCFs with input spanning deletions (#5397)
    • Define GenomicsDB "partitions" over the span of the input intervals in order to dramatically improve exome performance (#5540)
  • Miscellaneous Changes

    • Added liftover wdls and jsons for gnomAD 2.1 (#5604)
    • Added script to create Hg38 to B37 liftover chain (#5579)
    • Allow variant walkers to configure their caching behavior (#3480)
    • Bug fix for using a ReservoirDownsampler with a ReadsDownsamplingIterator (#5594)
    • Started migration to a new URI abstraction (#5526)
    • Fixed inclusion of default read filters in GATK documentation (#5576)
    • Put the actual date/time in the generated GATK documentation (#5567)
    • Pair-HMM alignment algorithm description fix (#5528)
    • Make ReadFilter and Annotation packages configurable (#5573)
    • Fix to make gatk --version print the version instead of throwing an exception (#5537)
    • Added warning message reminding user to add the allele specific annotation group when needed (#3042)
    • Fix for intermittent LeftAlignAndTrimVariants test failures (#5519)
    • Restored link in VariantFiltration docs to point to update online JEXL doc. (#5525)
    • Moved BucketUtils.deleteOnExit() and deleteRecursively() to IOUtils (#5332)
    • Source the tab completion script in the GATK docker image (#5552)
    • Added GATK jar to CLASSPATH in docker image (#3866)
    • Updated travis github badge link (#5617)
    • Removed offline CRAN repository from build (#5593)
  • Dependencies

    • Updated htsjdk to version 2.18.2 (#5585)
    • Updated picard to version 2.18.25 (#5597)