4.1.0.0
It's been a year since the GATK 4.0.0.0
release in January 2018, and we decided that it was time to package up the past year's worth of GATK improvements into a new major release, which we're calling version 4.1.0.0
!
To commemorate this milestone, we'll be publishing a series of in-depth technical articles and blog posts covering the major new features in version 4.1.0.0
on the official GATK blog.
Below we've compiled the highlights of the new features added between versions 4.0.0.0
and 4.1.0.0
. If you're interested in seeing only the changes between the last release (4.0.12.0
) and this release (4.1.0.0
), click here instead.
Official docker image is here: https://hub.docker.com/r/broadinstitute/gatk/
Major changes between versions 4.0.0.0 and 4.1.0.0 (January 2018 to January 2019):
-
Next-Gen VQSR Replacement For Single-Sample
- New suite of tools
CNNScoreVariants
,CNNVariantTrain
,CNNVariantWriteTensors
, andFilterVariantTranches
CNNScoreVariants
is now out of beta and ready for production use- Performs variant training and scoring using a convolutional neural network.
- Single-sample only
- Produces better results than the legacy
VariantRecalibrator
(VQSR) and comparable or better results to third-party tools likeDeepVariant
- Sophisticated 2D model that uses the reads
- New suite of tools
-
Major HaplotypeCaller Improvements
- Now genotypes and outputs spanning deletions
- Now outputs VCF spec-compliant phased variants
- Can emit MNPs via a new
--max-mnp-distance
argument - Important fix to the reference confidence calculation upstream of indels
- New
HaplotypeCaller
priors for variants sites and homRef blocks- Added new
--population-callset
argument allowing an external panel of variants to be specified to inform the frequency distribution underlying the genotype priors - Added new
--num-reference-samples-if-no-call
argument to control whether to infer (and with what effective strength) that only reference alleles were observed at sites not seen in any panel
- Added new
-
Major Mutect2 Improvements
Mutect2
is now out of beta- Support for multi-sample calling
- Lots of support for high-depth calling such as cfDNA, UMIs, mitochondria, including a new active region likelihood, probabilistic assembly graph pruning that adjusts to the local depth, a new mitochondria mode, and new filters for blood biopsy and mitochondria
- Now outputs VCF spec-compliant phased variants
- Can emit MNPs via a new
--max-mnp-distance
argument - Added a genotype given alleles (GGA) mode
- New STR indel error model that improves sensitivity and precision in STR (short-tandem repeat) contexts
- Many new/improved filters to reduce false positives (eg.,
FilterAlignmentArtifacts
) - Mutect2 now automatically recognizes and removes end repair artifacts in regions with inverted tandem repeats. This is extremely important for some FFPE samples.
- New probabilistic orientation bias tool
- Got rid of many questionable indels showing up in bamout of Mutect2 and the HaplotypeCaller
- Big improvements to CalculateContamination, especially when tumor has lots of CNVs
- NIO support in Mutect2 WDL
- Significant speed improvements
- Improved allele fraction estimation
- Initial GVCF output support
-
Mitochondrial Calling
- Added
--mitochondria-mode
toMutect2
andFilterMutectCalls
. This increases sensitivity and only applies filters that are optimized for mitochondria.
- Added
-
New allele frequency / qual score model
- Is now the default in
HaplotypeCaller
andGenotypeGVCFs
- Optimized for greater speed, should resolve many
GenotypeGVCFs
memory issues - Rare numerical finite precision issues in the allele-specific qual have been resolved
- Is now the default in
-
Major Improvements to the CNV (Copy Number Variation) tools
- The CNV tools are now out of beta.
- This includes the tools:
AnnotateIntervals
,CallCopyRatioSegments
,CollectAllelicCounts
,CollectReadCounts
,CreateReadCountPanelOfNormals
,DenoiseReadCounts
,DetermineGermlineContigPloidy
,FilterIntervals
,GermlineCNVCaller
,ModelSegments
,PostprocessGermlineCNVCalls
,PreprocessIntervals
,PlotDenoisedCopyRatios
, andPlotModeledSegments
- This includes the tools:
- Completed the
GermlineCNVCaller
(gCNV) pipeline and made various performance/runtime improvements to both the methods and WDLs. - Major changes include the addition of new tools (
PostprocessGermlineCNVCalls
,FilterIntervals
, andCollectReadCounts
, which replacesCollectFragmentCounts
), as well as improvements to existing tools (notably,AnnotateIntervals
). - Improved support for various formats, namely VCF output in the gCNV pipeline, IGV-compatible .seg output in the
ModelSegments
somatic CNV pipeline, and CRAM support for all CNV WDLs. - Developed tools and WDLs for tagging and filtering of germline events in the
ModelSegments
somatic CNV pipeline.
- The CNV tools are now out of beta.
-
Funcotator Official Release
- Funcotator is now out of beta
- Huge number of bug fixes and accuracy improvements. Output for several fields is now more correct than other well-known functional annotation tools.
- Some new features include:
- MAF output support
- NIO support for datasources
- gnomAD support
- dbsnp support
- Support for Mitochondrial amino acid sequence/protein change strings
- 5'/3' flank support
- Major performance improvements due to added caching
- Added ALL mode for transcript selection (
--transcript-selection-mode ALL
) which will output full annotation fields for all transcripts
- Created a new
FuncotatorDataSourceDownloader
tool to download data sources - Added an experimental
FilterFuncotations
tool
-
MarkDuplicatesSpark is now a Validated, Scalable Replacement for MarkDuplicates
- MarkDuplicatesSpark is now out of beta
- Rewritten version of the tool matches Picard
MarkDuplicates
output and has greatly improved performance and scalability - Supports multiple BAM inputs
- Indexes BAM outputs on-the-fly in parallel on a cluster
-
Additional Tools Ported from GATK3
- Ported
VariantAnnotator
- Ported
VariantEval
- Ported
FastaAlternateReferenceMaker
andFastaReferenceMaker
- Ported
LeftAlignAndTrimVariants
- Restored
GenotypeGVCFs
--include-non-variant-sites
argument
- Ported
-
Major Improvements to the SV (Structural Variation) Tools
- Improvements to collection and calling of events based on discordant read pair evidence.
- A new scaffolding algorithm greatly improves the contiguity of local assemblies, increasing sensitivity.
- Regions of excessive sequencing depth are excluded from evidence collection and assembly, improving runtime performance.
- A major overhaul of our algorithm for calling events based on local assemblies improves accuracy and allows for the accurate reporting of small complex SVs.
- A machine learning (xgBoost) based classifier for SV evidence improves runtime and increases accuracy by determining which regions should be fed into the local assembly workflow.
-
Spark Improvements
- New Disq Spark library allows faster and more accurate loading of formats like BAM and VCF
HaplotypeCallerSpark
now has a "strict mode" that closely matches the regularHaplotypeCaller
- Created
RevertSamSpark
, a parallelized Spark version of Picard'sRevertSam
tool - Migrated most Spark tools that take a reference and/or VCF to use Spark's intrinsic file copying mechanism instead of broadcast to distribute the reference and VCFs to worker nodes -- a big performance win!
-
GenomicsDB Improvements
- Allele-specific annotation support
- Multi-interval support (with some performance caveats)
- Support for sites-only queries
- Support for returning the GT field in queries
- New protobuf-based API to allow configuration without editing JSON files
- Added in machinery to allow per-annotation combine operations to be specified
- Allow for hdfs and gcs URI's to be passed to GenomicsDB
- Migrated from
com.intel.genomicsdb
toorg.genomicsdb
-
"Goodies" Worth Mentioning
- Added fasta.gz support to the
-R/--reference
argument in walker tools SelectVariants
can now drop specific annotation fields from the output vcfCalculateGenotypePosteriors
now supports indels- New tool
ReblockGVCF
to merge reference blocks in single-sample GVCFs for smaller filesizes - Improved MQ calculation accuracy, especially at sites with many uninformative reads; concomitant with new annotation tag and format
- The
-L
argument now supports GCS (Google Cloud Storage) for interval list files / bed / vcf files in walker tools - Added support for "Requester Pays" GCS (Google Cloud Storage) buckets via new
--gcs-project-for-requester-pays
argument - Added GCS (Google Cloud Storage) output (-O) support to more tools
- Improved Python integration (eliminated timeouts and reliance on prompt synchronization) means fewer glitches during runs of ML-based tools
- A significantly (~33%) smaller GATK docker image
- Changed argument tagging syntax from "--arg tag:value" to "--arg:tag value"
- Affects command-line interface for
VariantRecalibrator
,VariantEval
,VariantFiltration
, andVariantAnnotator
- Affects command-line interface for
- Added fasta.gz support to the
Changes between versions 4.0.12.0 and 4.1.0.0 only:
-
Many tools are now out of beta and ready for production use!
CNNScoreVariants
is out of beta (#5548)Funcotator
andFuncotatorDataSourceDownloader
are out of beta (#5621)MarkDuplicatesSpark
is out of beta (#5603)- CNV tools are out of beta (#5596). This includes:
AnnotateIntervals
,CallCopyRatioSegments
,CollectAllelicCounts
,CollectReadCounts
,CreateReadCountPanelOfNormals
,DenoiseReadCounts
,DetermineGermlineContigPloidy
,FilterIntervals
,GermlineCNVCaller
,ModelSegments
,PostprocessGermlineCNVCalls
,PreprocessIntervals
,PlotDenoisedCopyRatios
, andPlotModeledSegments
-
New tools:
- Added ports of
FastaAlternateReferenceMaker
andFastaReferenceMaker
from GATK3 (#5549) RevertSamSpark
: a parallelized, Spark-based implementation ofRevertSam
from Picard (#5395)CompareIntervalLists
: simple new tool to compare interval lists (#3702)CountBasesInReference
: simple new tool to count bases in a reference file (#5549)PrintBGZFBlockInformation
: a tool to dump information about blocks in a BGZF file (#4239)
- Added ports of
-
Mutect2
- Mutect2 now works with multiple tumor and normal samples! (#5560)
- First iteration of a reference confidence GVCF-like output for Mutect2 to enable mitochondrial joint calling (#5312)
- Changed default blocking and NON-REF LOD params for Mutect2 GVCF mode (#5615)
- Changed defaults for mitochondria mode now that we have adaptive pruning (#5544)
- Fixed an edge case bug when Mutect2 sees a variant with population AF = 1 (#5535)
- Fixed an edge case of zero-depth in
FilterMutectCalls
germline filter (#5578) - Fixed an edge case for the Mutect2 germline resource (#5563)
- Tweaked the Mutect2 germline filter (#5595)
- Put new orientation bias model in Mutect2 NIO wdl (#5580)
- Improve proposed tumor in normal docs to account for new Mutect2 options (#5555)
-
Added a copy of the mitochondria best practices pipeline (#5566) (#5612)
-
HaplotypeCaller
- New allele frequency / qual score model is now the default in HaplotypeCaller and GenotypeGVCFs (#5484)
- Simplified and sped
KBestHaplotypeFinder
by replacing recursion with Dijkstra's algorithm (#5462) (#5554) - Forward input BAM @pg header lines to
-bamout
output BAM (#3065) - Small performance improvement in GVCF mode (#5470)
-
CNV Tools
- Out of beta, as mentioned above! (#5596)
- Added per-sample denoised coverage output to gCNV (#5584)
ModelSegments
: Added separate allele-count thresholds for the normal and tumor (#5556)ModelSegments
: AddedMinibatchSliceSampler
and replaced naive subsampling (#5575)- Restored array output in gCNV WDLs for efficient postprocessing. (#5490)
-
Changed tagged argument syntax from
--argument tag:value
to--argument:tag value
(#5526)- For example,
--resource known,known=true,prior=10.0:myFile
becomes--resource:known,known=true,prior=10.0 myFile
- This change affects
VariantRecalibrator
,VariantEval
,VariantFiltration
, andVariantAnnotator
- For example,
-
Funcotator
- Out of beta, as mentioned above! (#5621)
- New datasource release that fixes many issues and adds
gnomAD
support (#5614) - VCF Data Sources now preserve the FILTER field (#5598)
- Funcotator now gets the NCBI build version from the datasource config file (#5522)
- Funcotator now ignores transcript version numbers when matching on transcript ID (#5557)
- Funcotator now uses the GATK-wide version number (#5520)
- Updated Funcotator tool documentation (#5620)
-
MarkDuplicatesSpark
-
Spark tools
- Support for distributed BAI index creation, and option for enabling or disabling writing BAI and SBI files on Spark (#5485)
- Get
HaplotypeCallerSpark
"strict mode" running on an exome (#5475) - Added an option for enabling or disabling writing tabix indexes for bgzipped VCF files from Spark (#5574)
- Fixed overflow bug in
GatkSparkTool.getRecommendedNumReducers()
(#5586)
-
GenomicsDB
-
Miscellaneous Changes
- Added liftover wdls and jsons for gnomAD 2.1 (#5604)
- Added script to create Hg38 to B37 liftover chain (#5579)
- Allow variant walkers to configure their caching behavior (#3480)
- Bug fix for using a
ReservoirDownsampler
with aReadsDownsamplingIterator
(#5594) - Started migration to a new URI abstraction (#5526)
- Fixed inclusion of default read filters in GATK documentation (#5576)
- Put the actual date/time in the generated GATK documentation (#5567)
- Pair-HMM alignment algorithm description fix (#5528)
- Make ReadFilter and Annotation packages configurable (#5573)
- Fix to make
gatk --version
print the version instead of throwing an exception (#5537) - Added warning message reminding user to add the allele specific annotation group when needed (#3042)
- Fix for intermittent
LeftAlignAndTrimVariants
test failures (#5519) - Restored link in
VariantFiltration
docs to point to update online JEXL doc. (#5525) - Moved
BucketUtils.deleteOnExit()
anddeleteRecursively()
toIOUtils
(#5332) - Source the tab completion script in the GATK docker image (#5552)
- Added GATK jar to CLASSPATH in docker image (#3866)
- Updated travis github badge link (#5617)
- Removed offline CRAN repository from build (#5593)
-
Dependencies