Releases: broadinstitute/gatk
4.1.4.1
Download release: gatk-4.1.4.1.zip
Docker image: https://hub.docker.com/r/broadinstitute/gatk/
Highlights of the 4.1.4.1 release:
- New experimental
HaplotypeCaller
assembly mode which improves phasing, reduces false positives, improves calling at complex sites, and has 15-20% speedup vs the current assembler. It is enabled with option--linked-de-bruijn-graph
. This mode is still experimental and not recommended for production use yet. IndexFeatureFile
improvements:- now cloud enabled
- changed controversial
F
argument toI
instead.
- Bug fixes and improvements in
GenomicsDB
,Mutect2
, variant annotation, and more!
Full list of changes:
-
New Tools
PrintReadsHeader
: a new tool to print a BAM/SAM/CRAM header to a file (#6153)
-
HaplotypeCaller
-
Mutect2
Mutect2
now warns but does not fail when three or more reads have the same name. (#6240)- Fixed the random seed at the beginning of
FilterMutectCalls
(#6208) GetSampleName
andGetPileupSummaries
in the M2 pipeline are no longer beta. (#6215)- Increase number of iterations in
CalculateContamination
to 30. (#6282) - Handled an edge case with high scatter count in M2 WDL. (#6216)
- Use ArgumentsBuilder in M2 tests. (#6219)
-
Joint Calling
-
CNV Calling
- Fixed model parameter assignment typo in gCNV ploidy model (#6285)
- Added docker option to the gcnv QC tasks. (#6185)
- Added epsilons to overdispersion in gCNV models to avoid NaNs. (#6245) #4824 #6226 #6227
- Assert that ELBO did not become NaN during each step of inference of gCNV. (#6186)
- Added ability to override
THEANO_FLAGS
environment variable in gCNV tools. (#6244) #6235 - Removed erroneous short argument names in R scripts for CNV plotting. (#6197)
-
GenomicsDB
- Allow GATK to configure annotation processing instead of hardcoding values in GenomicsDB GDB-39
- High ploidy sites with many genotypes no longer causes an overflow error. GDB-54
- Add missing libcurl in the native GenomicsDB library. #6122 GDB-66
- No longer crashes when vcfbufferstream from htslib appears to be invalid. GDB-67
- Propagated native GenomicsDB exceptions as java IOExceptions. GDB-68
- Fix issue when using vid protobuf interface and there is more than 1 config. GDB-70
- Cleanup GenomicsDB vid combine protobuf mapping overrides. #6190
-
Miscellaneous Changes
- Cloud-enable
IndexFeatureFile
and change input arg name from -F to -I. (#6246) #6161 - WDL to run
ReadsPipelineSpark
on a multicore machine (#6213) - Replace
TwoPassReadWalker
with more generalMultiplePassReadWalker
. (#6154) - Abolish unfilled likelihoods and revamp
VariantAnnotator
. (#6172) - Improve exception message in
ValidateVariants
. (#6076) - Fix Syntax Warning when running GATK with python 3.8 (#6231)
- Cloud-enable
-
Developer / Testing
-
Documentation
-
Dependencies
4.1.4.0
Download release: gatk-4.1.4.0.zip
Docker image: https://hub.docker.com/r/broadinstitute/gatk/
Highlights of the 4.1.4.0 release:
-
Major improvements and fixes to
Mutect2
, including more intelligent handling of paired reads during genotyping and better filtering. -
Important bug fixes to
HaplotypeCaller
, the joint calling pipeline, andFuncotator
-
Beta support for building/testing on Java 11 (#6119) (#6145)
- We encourage you to try this out and give us feedback!
Full list of changes:
-
New Tools
AlleleFrequencyQC
: a QC tool that usesVariantEval
to bin variants in 1000 Genomes by allele frequency. For each bin, we compare the expected allele frequency from 1000 Genomes with the observed allele frequency in the input VCF. This was designed with arrays in mind, as a way to discover potential bugs in our pipeline. #6039)
-
Mutect2
Mutect2
genotyping now forces paired reads to support the same haplotype (#5831)- New
FilterAlignmentArtifacts
now realigns a locally-assembled unitig of all variant read pairs (#6143) - Fixed a
Mutect2
bug that overfiltered by one variant (#6101) - Fixed a small gene panel edge case for
CalculateContamination
(#6137) - Fixed a small gene panel edge case in orientation bias filter (#6141)
- Unified the NIO and non-NIO M2 WDLs (call-caching will now work on Terra) (#6108)
- Updated
Mutect2
pon WDL to WDL 1.0 (#6187) - Removed
Oncotator
from the M2 WDL (Funcotator
is still there) (#6144) - Fixed an issue in the M2 WDL that could cause the Funcotate task to be ignored by tools such as dxWDL (#6077)
- Some miscellaneous code refactoring/improvements (#6184) (#6136) (#6107) (#6159)
-
HaplotypeCaller
HaplotypeCaller
now force-calls likeMutect2
: the-genotyping-mode GENOTYPE_GIVEN_ALLELES
argument is gone (now you only need to specify--alleles force-calls.vcf
) and alleles are now force-called in addition to any other alleles (#6090)- Renamed
--output-mode EMIT_ALL_SITES
to--output-mode EMIT_ALL_ACTIVE_SITES
, and clarified the documentation for the argument (#6181) - Fixed a rare bug in the genotyping engine where it could emit untrimmed alleles for SNP sites (#6044)
- Fixed some sources of non-determinism in the
HaplotypeCaller
that in rare cases could cause the output to vary slightly given the same inputs (#6195) (#6104) - Deleted the old exact AF calculation model (#6099)
-
Joint Calling
- Fixed a regression in GATK 4.1.3.0 that caused us to not emit the
AS_QD
annotation when running a joint calling pipeline withCombineGVCFs
(GenomicsDB
was unaffected) (#6168) - Fixed allele-specific annotation array length issues when alleles are subset in tools such as
GenotypeGVCFs
(#6079) - Changed
AS_RankSum
outputs to "." for missing values rather than "nul" (#6079)
- Fixed a regression in GATK 4.1.3.0 that caused us to not emit the
-
Funcotator
- Fixed a bug that caused
Funcotator
to outputs fields in wrong order in some cases when writing a VCF (#6178)- Specifically,
Funcotator
would output functation fields in the wrong order when there was more than 1 site in a VCF data source with the exact same position and alleles and it matched one of the variants being annotated
- Specifically,
- Fixed a bug that caused
-
Mitochondrial pipeline
- Renamed the output vcf with the name of the sample and supplied a default value for
autosomal_median_coverage
(meaning you'll now get theNuMT
filter even if you don't provide the actual autosomal coverage) (#6160)
- Renamed the output vcf with the name of the sample and supplied a default value for
-
Miscellaneous Changes
- Beta support for building/testing on Java 11 (#6119) (#6145)
UpdateVCFSequenceDictionary
now supports replacing an invalid sequence dictionary in a VCF (#6140)CountFalsePositives
now requires an intervals file (#6120)AnalyzeSaturationMutagenesis
: use supplementary alignments to identify large deletions (#6092)AnalyzeSaturationMutagenesis
: an insert at the start codon is not in the ORF (#6121)- Added a check for null sequence dictionaries in the dictionary validation code (#6147)
- Update SV Spark pipeline example shell scripts saving results to GCS (#6114)
- Update public key for installing R in docker (#6116)
- Log exceptions during deletion on JVM exit instead of throwing (#6125)
- Don't fail the build if we're in a git worktree folder (#6169)
- Free a bit of memory fir the test suite by disabling mysql and postgress on travis (#6085)
- Delete bogus index files for queryname sorted CRAMs. (#6149)
- Cleanup GenomicsDB debugging test output (#6089)
-
Documentation
- Fixed mitochondria mode documentation in
FilterMutectCalls
(#6174)
- Fixed mitochondria mode documentation in
-
Dependencies
4.1.3.0
Download release: gatk-4.1.3.0.zip
Docker image: https://hub.docker.com/r/broadinstitute/gatk/
Highlights of the 4.1.3.0 release:
GnarlyGenotyper
, a new beta joint genotyping tool which, along withReblockGVCF
, forms part of a forthcoming more scalable version of our joint genotyping pipeline that we call the "GATK Biggest Practices" pipelineFuncotateSegments
, a new beta companion tool toFuncotator
that performs functional annotation on a segment file (.seg
) rather than a VCFGenomicsDBImport
now has the ability to incrementally update an existing GenomicsDB workspace- Several important bug fixes to
HaplotypeCaller
andMutect2
Compatibility notes:
GermlineCNVCaller
models built in cohort mode with previous releases are no longer compatible. Users should rebuild these models with this release before runningGermlineCNVCaller
in case mode. See the CNV Tools section below for more details.
Full list of changes:
-
New Tools
-
GnarlyGenotyper (beta tool) (#4947) (#6075)
- The
GnarlyGenotyper
is designed to perform joint genotyping on cohorts of at least tens of thousands of samples called withHaplotypeCaller
and post-processed withReblockGVCF
to produce a multi-sample callset in a super highly scalable manner. - Caveats:
GnarlyGenotyper
is intended to be used with GVCFs for which low quality variants have already been removed, derived from post-processingHaplotypeCaller
GVCFs withReblockGVCF
. See the "Biggest Practices" usage example in theReblockGVCF
docs for details.GnarlyGenotyper
does not subset alternate alleles and can return some highly multi-allelic sites. PLs will not be output for sites with more than 6 alts to save space.GnarlyGenotyper
assumes all diploid genotypes
- Annotations:
- To generate all the annotations necessary for VQSR, input variants to the
GnarlyGenotyper
must include theQUALapprox
andVarDP
annotations along with the latestRAW_MQandDP
annotation. - If allele-specific annotations are present, they will be used appropriately and a new
AS_AltDP
annotation giving the total depth across samples for each alternate allele will be added.
- To generate all the annotations necessary for VQSR, input variants to the
- A GATK "Biggest Practices" pipeline including the
GnarlyGenotyper
is forthcoming pending some fixes improving on the above caveats.
- The
-
FuncotateSegments (beta tool) (#5941)
- A companion tool to
Funcotator
that performs functional annotation on a segment file (.seg
) rather than a VCF - The Somatic CNV pipeline can optionally run this tool for functional annotation
- A companion tool to
-
-
HaplotypeCaller/Mutect2
- Fixed a regression in
HaplotypeCaller
/Mutect2
that caused some variants to be lost at sites with high complexity (#5952) - Fixed a GGA (GENOTYPE_GIVEN_ALLELES) mode bug in
HaplotypeCaller
/Mutect2
where added alleles' cigars could have soft clips (#6047)- This bug would manifest as a "Cigar cannot be null" error
- Fixed a bug where cached indel informativeness values could be incorrectly applied to the wrong sites in
HaplotypeCaller
/Mutect2
(#5911) - Fixed an edge case in
HaplotypeCaller
/Mutect2
where dangling end merging creates cycles (#5960) - Added hidden arguments to the assembly engine to track found haplotype counts and kmers used (#6049)
- Fixed a bug in
CalculateContamination
when contamination is indistinguishable from zero (#5971) - Fixed a bug where normal p value argument in
FilterMutectCalls
was declared static (#5982)
- Fixed a regression in
-
CNV Tools
- Added
FuncotateSegments
as an option to the Somatic CNV WDL (#5967) - Added QC metrics to the Germline CNV workflow (#6017)
- Enabled GC-bias correction by default in CNV workflows (#5966)
- Added denoised coverage file concatenation output to gCNV postprocessor (#5823) Note: The addition of this feature breaks compatibility with gCNV cohort-mode models built with previous releases.
- Changed cr.igv.seg output of ModelSegments to give log2 Segment_Mean. (#5976)
- Fixed CNV plotting script to allow spaces in input filenames. (#5983)
- Added
-
GenomicsDBImport
- Added support for making incremental updates to existing workspaces (#5970)
- This can be done using the new
--genomicsdb-update-workspace-path
argument
- This can be done using the new
- Fixed a crash in
GenomicsDBImport
on queries at positions inside deletions (#5899) - Treat AS_QUALapprox and AS_VarDP strings as array of int vectors (#5933)
- Added support for making incremental updates to existing workspaces (#5970)
-
Mitochondrial Calling Pipeline
- Added NIO support and updated to WDL 1.0 (#6074)
-
Spark Tools
- Removed the beta label from many simple Spark tools (#5991)
- Bug fix for reading references from GCS on Spark (#6070)
- Eliminated an unnecessary sort step in
HaplotypeCallerSpark
(#5909) - Fixed
BaseRecalibratorSpark
failure on a cluster due to system classloader issue (#5979) - Added a WDL for
ReadsPipelineSpark
(#5904) - Added a command-line argument to toggle using NIO on reading for Spark (#6010)
- Added advanced arguments to
MarkDuplicatesSpark
to allow non-queryname sorted inputs when specifying multiple input bams and to treat unsorted inputs as queryGroup-sorted (#5974) - Clarified the behavior of
MarkDuplicatesSpark
when given multiple input bams, and improved the sorting behavior if given a mix of queryname-sorted and query-grouped bams (#5901) - Changed
spark.yarn.executor.memoryOverhead
tospark.executor.memoryOverhead
as promoted by Spark 2.3 (#6032) - Handle newly-added arguments in
ApplyBQSRUniqueArgumentCollection
(#5949)
-
Miscellaneous Changes
- Added a new
BaseQualityHistogram
variant annotation to generate base quality histograms (#5986) - Added a new
SoftClippedReadFilter
that can filter out reads where the ratio of soft-clipped bases to total bases exceeds some given value (#5995) - Fixed a serious bug in
ValidateVariants
where the tool would silently do no validation in the default case when a DBSNP file was not provided (#5984) - Fixed a "Record covers a position previously traversed" error in
ValidateVariants
for GVCFS with multiple contigs (#6028) - The
RMSMappingQuality
annotation now requires the--allow-old-rms-mapping-quality-annotation-data
argument to run with GVCFs created by older versions of the GATK (#6060) - Added a simple TSV/CSV/XSV writer with cloud write support as an alternative to TableWriter (#5930)
Funcotator
: added Funcotator stand-alone WDL to supported area (#5999)- Extracted the
GenotypeGVCFs
engine into publicly accessible class/function (#6004) - Refactored
VariantEval
methods to allow subclasses to override (#5998) AnalyzeSaturationMutagenesis
: arbitrarily choose 1 read for disjoint pairs, dump rejected reads, and various other improvements (#5926) (#6043)- Normalized some AssemblyRegion args in
HaplotypeCallerSpark
(#5977) - Don't redundantly delete temporary directories in
RSCriptExecutor
(#5894) - Treat all source files as UTF-8 for java, javadoc (#5946)
- Updated an out-of-date argument name in an error message for the
CycleCovariate
- Changed an error about "duplicate feature inputs" to be a UserException (#5951)
- Got rid of
ExpandingArrayList
in favor ofArrayList
(#6069) - Disabled Codecov for now on travis due to spurious errors (#6052)
- Lowered the Xms value in the test JVM (#6087)
- Updated the travis installed R version to 3.2.5, matching our base docker image (#6073)
- Fixed an erroneous warning about GCS test configuration (#5987)
- Added a code of conduct (#6036)
- Added a new
-
Documentation
FilterVariantTranches
documentation fix and improvement (#5837)- Updated
FilterMutectCalls
usage examples (#5890) - Added
--max-mnp-distance 0
to usage example inCreateSomaticPanelOfNormals
docs (#5972) - Updated the
MarkDuplicatesSpark
documentation to no longer contain a misleading usage example (#5938) - Added a clarification to the README to warn users to set their Gradle JVM properly in Intellij after setup (#6066)
- Added links to download Java 8 to the README (#6025)
- Remove non-ascii chars from javadoc (#5936)
-
Dependencies
4.1.2.0
Download release: gatk-4.1.2.0.zip
Docker image: https://hub.docker.com/r/broadinstitute/gatk/
Highlights of the 4.1.2.0 release:
- Two new tools,
MethylationTypeCaller
andAnalyzeSaturationMutagenesis
(see below for descriptions) - Significant improvements to
GENOTYPE_GIVEN_ALLELES
mode inMutect2
andHaplotypeCaller
- Fixed a serious bug in
Funcotator
that could cause END positions to be wrong for some deletions in MAF output - Significant updates to the mitochondrial calling pipeline
Full list of changes:
-
New Tools
- MethylationTypeCaller (#5762)
- Identifies methylated bases from bisulfite sequencing data. Given a bisulfite sequenced, methylation-aware aligned BAM and a reference, it outputs methylation-site coverage to a specified output vcf file.
- AnalyzeSaturationMutagenesis (#5803)(#5883)
- Processes reads from a saturation mutagenesis experiment, an experiment that systematically perturbs a mini-gene to ascertain which amino-acid variations are tolerable at each codon of the open reading frame. Its main job is to discover variations from wild-type sequence among the reads, and to summarize the variations observed.
- MethylationTypeCaller (#5762)
-
Mutect2
- Made significant improvements to
GENOTYPE_GIVEN_ALLELES
mode inMutect2
andHaplotypeCaller
(#5874). These improvements are described in more detail in #5857 CalculateContamination
now works much better for very small gene panels (#5873)- We now correctly handle inputs with 100% contamination in
Mutect2
filtering (#5853) Mutect2
now uses natural logarithms internally (#5858). This does not change any outputs.- Minor update to the
Mutect2
PON WDL (#5859)
- Made significant improvements to
-
Funcotator
- Fixed a serious bug that could cause END positions to be wrong for some deletions in MAF output (#5876)
- The tool now throws a user error for an AD field with only 1 value in MAF mode (#5860)
- Added a new filter to
FilterFuncotations
. For two autosomal recessive genes, MUTYH and ATP7B, homozygous variants and compound heterozygous variants will be tagged and added to the output vcf. (#5843)
-
Mitochondrial Calling Pipeline
- Updated the pipeline for the new
Mutect2
filtering scheme and pulled filtering after the liftover and recombining of the VCF. (#5847) - Made the subsetting of the WGS bam fast by using
PrintReads
over just chrM instead of traversing the whole bam for NuMT mates. (#5847) - Moved polymorphic NuMTs based on autosomal coverage to a filter (it was an annotation before) (#5847)
- Added an option to hard filter by VAF (#5847)
- Bug fix for large input files to the mitochondrial pipeline (we now include the size of the input BAM/CRAM when calculating disk size, when necessary) (#5861)
- Updated the pipeline for the new
-
Structural Variation Calling Pipeline
- Bug fix to
QNameFinder
to handle reads with negative unclipped starts (#5864)
- Bug fix to
-
Miscellaneous Changes
- Added a
--min-fragment-length
argument to theFragmentLengthReadFilter
(#5886) - Added a
--spark-verbosity
argument to control verbosity of Spark-generated logs (#5825) - Added a new
WalkerBase
abstract class to be used for all built-in walkers (#4964) - Exposed transient attributes in the
GATKRead
API (#5664) - Convert more code to use
GATKPathSpecifier
(#5870) (#5832). This also fixes anInvalidPathException
on Windows machines. - Fixes to the test suite related to the recent introduction of a codec for Picard interval lists (#5879)
- Eliminated an error message during the Docker build in Travis logs by creating a directory before copying to it. (#5878)
- Added a
-
Documentation
4.1.1.0
Highlights of the 4.1.1.0 release:
- A substantial (~33%) speedup to the
HaplotypeCaller
in GVCF mode (-ERC GVCF
) - Major updates to
Mutect2
, including completely overhauled filtering and smarter handling of overlapping read pairs. - A tensorflow update for
CNNScoreVariants
that speeds up the tool by roughly ~2X when using the 2D model. - Important updates to the mitochondrial calling pipeline, and improved memory usage in the CNV pipeline.
- Important bug fixes to
Funcotator
,VariantEval
,GenomicsDBImport
, and other tools, as well as to the--pedigree
argument for annotations.
Docker image: https://hub.docker.com/r/broadinstitute/gatk/
Full list of changes:
-
HaplotypeCaller
- Greatly improved the performance of the ReferenceConfidenceModel using dynamic programming and caching (#5607)
- This speeds up whole-genome GVCF mode calling (
-ERC GVCF
) by ~33% in our tests!
- This speeds up whole-genome GVCF mode calling (
- Optimized some additional performance hotspots in the ReferenceConfidenceModel (#5616) (#5469) (#5652)
- Can now write VCF outputs to Google Cloud Storage (GCS) (#5378)
- Don't output variants with no ALT allele if the * (spanning deletion) allele gets dropped (#5844)
- Added a
--force-active
argument that marks all regions as active. Useful for debugging/diagnostics. (#5635) HaplotypeCallerSpark
: made performance improvements to allow the tool to run on WGS in strict mode (#5721)- Fixed rare infinite recursion bug in
KBestHaplotypeFinder
(also affectsMutect2
)(#5786)
- Greatly improved the performance of the ReferenceConfidenceModel using dynamic programming and caching (#5607)
-
Mutect2
- Overhaul of
FilterMutectCalls
, which now applies a single threshold to an overall error probability (#5688)FilterMutectCalls
automatically determines the optimal threshold.- The new somatic clustering model learns tumors' allele fraction spectra and overall SNV and indel mutation rates in order to improve filtering.
- Includes a rewrite of
Mutect2
documentation -- better organization and now includes command line examples in addition to math.
Mutect2
now modifies base and indel qualities of overlapping paired reads to account for PCR error rather than discarding reads (#5794)- This especially improves indel sensitivity.
- Optimized
Mutect2
read orientation filtering by collecting F1R2 counts from within Mutect2 itself, greatly reducing wall-clock and CPU time (#5840) - New
Mutect2
panel of normals workflow usingGenomicsDB
for scalability (#5675)- Panel of normals removes germline variants in order to contain only technical artifacts, and contains information about artifact prevalence
- Rewrote
Mutect2
active region likelihood as special case of full somatic likelihoods model, which reduces runtime by 5% (#5814) Funcotator
updates inMutect2
WDL (#5742) (#5735)- Prune assemby graph before checking for cycles (#5562)
- Refactor
Mutect2
inheritance so that it doesn't have inactive arguments (#5758) - Added CRAM support to the
Mutect2
WDL (#5668) - Split MNPs in
Mutect2
PON WDL, fixing a potential bug (#5706) - Handle negative infinity log likelihoods from PairHMM in
Mutect2
(#5736) - Fixed overfiltering in
Mutect2
in GGA alleles mode with no reads (#5743) - Correct some
Mutect2
VCF header lines (#5792) - Handle unmarked duplicates with mate MQ = 0 in
Mutect2
(#5734) - Output sample names in
Mutect2
PON header (#5733) - Avoid error due to finite precision error in
Mutect2
PON creation (#5797) - Update
Mutect2
javadoc to reflect v4.1 changes. (#5769) - Renamed the
OxoGReadCounts
annotation toOrientationBiasReadCounts
(#5840)
- Overhaul of
-
CNNScoreVariants
- We now use the latest Intel-optimized tensorflow (#5725)
- This speeds up the 2D CNN by roughly 2X in our tests!
FilterVariantTranches
is out of beta (#5628)- Fixed
CNNScoreVariants
hanging when the conda environment is not set up (#5819)- We now make sure that the GATK tool Python package is present before executing streaming Python commands.
- Extensive updates to the CNN WDLs (#5251)
- We now use the latest Intel-optimized tensorflow (#5725)
-
Mitochondrial Calling Pipeline
- Added an option to recover all dangling branches, on by default for MT calling (#5693)
- Fixes a large number of missed calls
- Use adaptive pruning in the mitochondria pipeline (#5669)
- Changed defaults in mitochondria mode in response to
Mutect2
filtering overhaul (#5827) - Allowed the MT pipeline to work on bams with a mix of single and paired-end reads (#5818)
- Added a hard filter to M2 for polymorphic NuMTs and low VAF sites (#5842)
- Updated the
haplochecker
version to0.1.2
to fix a bug with flipping the major and minor hg headers in its output (#5760) - Added the rest of the mitochondria joint-calling pipeline (#5673)
- Merging and genotyping "somatic" GVCFs from
Mutect2
- Merging and genotyping "somatic" GVCFs from
- Added a read filter for unmapped reads and their mates (#5826)
- Refactored the MT WDL to make validations easier (#5708)
- Updated a variable name in MT WDL to match gatk-workflows version (#5694)
- Added an option to recover all dangling branches, on by default for MT calling (#5693)
-
GenotypeGVCFs
- Added an option to merge intervals for better
GenotypeGVCFs
performance onGenomicsDB
exome input (#5741) - Trim per-allele FORMAT annotations and optionally retain raw AS annotations (#5833)
GenotypeGVCFs
now uses the header info to determine if FORMAT lists need to be subset when alleles are dropped- Fixes "F1R2 and F2R2 annotations not updated by GenotypeGvcfs" (#5704)
- Added an option to merge intervals for better
-
Funcotator
- Non-locatable data sources can create funcotations again (#5774)
- Fixes a bug where
Funcotator
was not adding funcotations from non-locatable data sources
- Fixes a bug where
- Fixed handling of symbollic alleles when determining best transcript for
GencodeFuncotation
creation. (#5834) FilterFuncotations
: support for multi-allelic variants (#5588)FilterFuncotations
: support for gnomAD for allele frequency inClinVarFilter
andLofFilter
, with a new argument telling it which dataset of gnomAD or ExAC to use (#5691)- Added
#
as a character to be sanitized byVCFOutputRenderer
(#5817) - Added in Markdown files for Funcotator forum posts (#5630)
- Updated
Funcotator
documentation with a FAQ section to respond to user comments (#5755)
- Non-locatable data sources can create funcotations again (#5774)
-
CNV Tools
- Improved memory usage in gCNV (#5781)
- Improved memory requirements of
CollectReadCounts
(#5715) - Added some fixes for minor CNV issues (#5699)
- Added io_commons.read_csv to address issues with formatting of sample names in gCNV (#5811)
- Added gCNV PROBPROG 2018 extended abstract, archived notes on CNV methods, and deleted some legacy documentation (#5732)
-
Miscellaneous Changes
SelectVariants
can now write VCF outputs to Google Cloud Storage (GCS) (#5378)VariantEval
bug fix: don't require the output file to already exist (#5681)- Fixed the
--pedigree
argument in thePossibleDeNovo
annotation (#5663) GenomicsDBImport
: fixed a core dump when querying overlapping deletions (#5799)GatherPileupSummaries
: a new tool that combines the output ofGetPileupSummaries
from disjoint scatter jobs (#5599)VariantsToTable
: add splitting for allele-specific annotations and ADs (#5697)CalculateGenotypePosteriors
: fix reported bug where no-call genotypes with no reads get genotype posterior probabilities and calls (#5667)- Added a new argument to Spark tools enabling the user to control whether to sort the reads on output (#4874)
ReadsPipelineSpark
: fixed an "Interval not within the bounds of a contig" error (#5645)Concordance
: fixed the tool to allow for no variation alleles in the truth data. (#5718)ReblockGVCF
: fix sites with zero AD to actually use SITE-level DP value as intended in (#5835)- Change
UpdateVCFSequenceDictionary
to use the specified dictionary uniformly (#5093) - Fixed gatk-nightly Docker builds (https://hub.docker.com/r/broadinstitute/gatk-nightly/) (#5759)
- Print the Picard/HTSJDK versions in addition to the GATK version when running with
--version
(#5757) IndexFeatureFile
: fixed a crash on VCFs with 0 records (#5795)PrintBGZFBlockInformation
: removed the file extension check so that we can accept bams (#5801)- Added a new read filter:
IntervalOverlapReadFilter
(#5656) - Add NIO Path support to
TableReader
andTableWriter
(#5785) - Replaced
IntervalsSkipList
withOverlapDetector
(#4154) - Removed some unused arguments in VCF merging code (#5745)
- Kebab-case some arguments in
LocusWalker
andLocusWalkerSpark
(#5770) - Removed an unnecessary IllegalArgumentException in
PairHMM
(#5705) - Removed accidental uses of log4j v1 (#5682)
- Improvements to Spark evaluation scripts (#5815)
- Extract tests from
PrintReadsIntegrationTest
to share with the Spark version. (#5689)
-
Documentation
- Improved the documentation for the
StrandOddsRatio
annotation (#5703) - Fixed the descriptions of some
HaplotypeCaller
arguments (#5658) - Update
VariantRecalibrator
example code to reflect new tagged argument syntax (#5710) - Corrected javadoc for the
InbreedingCoeff
annotation (#5768) CalculateGenotypePosteriors
: minor updates to javadoc and logger type (#5601)- Added and Updated javadoc for
SortSamSpark
andMarkDuplicatesSpark
(#5672) - Added a link to a "GitHub basics for researchers" article at top of the GATK README (#5643)
- Updated the main GATK README to remove outdated references to the Intel conda environment (#5753)
- Trimmed overly-long tool...
- Improved the documentation for the
4.1.0.0
It's been a year since the GATK 4.0.0.0
release in January 2018, and we decided that it was time to package up the past year's worth of GATK improvements into a new major release, which we're calling version 4.1.0.0
!
To commemorate this milestone, we'll be publishing a series of in-depth technical articles and blog posts covering the major new features in version 4.1.0.0
on the official GATK blog.
Below we've compiled the highlights of the new features added between versions 4.0.0.0
and 4.1.0.0
. If you're interested in seeing only the changes between the last release (4.0.12.0
) and this release (4.1.0.0
), click here instead.
Official docker image is here: https://hub.docker.com/r/broadinstitute/gatk/
Major changes between versions 4.0.0.0 and 4.1.0.0 (January 2018 to January 2019):
-
Next-Gen VQSR Replacement For Single-Sample
- New suite of tools
CNNScoreVariants
,CNNVariantTrain
,CNNVariantWriteTensors
, andFilterVariantTranches
CNNScoreVariants
is now out of beta and ready for production use- Performs variant training and scoring using a convolutional neural network.
- Single-sample only
- Produces better results than the legacy
VariantRecalibrator
(VQSR) and comparable or better results to third-party tools likeDeepVariant
- Sophisticated 2D model that uses the reads
- New suite of tools
-
Major HaplotypeCaller Improvements
- Now genotypes and outputs spanning deletions
- Now outputs VCF spec-compliant phased variants
- Can emit MNPs via a new
--max-mnp-distance
argument - Important fix to the reference confidence calculation upstream of indels
- New
HaplotypeCaller
priors for variants sites and homRef blocks- Added new
--population-callset
argument allowing an external panel of variants to be specified to inform the frequency distribution underlying the genotype priors - Added new
--num-reference-samples-if-no-call
argument to control whether to infer (and with what effective strength) that only reference alleles were observed at sites not seen in any panel
- Added new
-
Major Mutect2 Improvements
Mutect2
is now out of beta- Support for multi-sample calling
- Lots of support for high-depth calling such as cfDNA, UMIs, mitochondria, including a new active region likelihood, probabilistic assembly graph pruning that adjusts to the local depth, a new mitochondria mode, and new filters for blood biopsy and mitochondria
- Now outputs VCF spec-compliant phased variants
- Can emit MNPs via a new
--max-mnp-distance
argument - Added a genotype given alleles (GGA) mode
- New STR indel error model that improves sensitivity and precision in STR (short-tandem repeat) contexts
- Many new/improved filters to reduce false positives (eg.,
FilterAlignmentArtifacts
) - Mutect2 now automatically recognizes and removes end repair artifacts in regions with inverted tandem repeats. This is extremely important for some FFPE samples.
- New probabilistic orientation bias tool
- Got rid of many questionable indels showing up in bamout of Mutect2 and the HaplotypeCaller
- Big improvements to CalculateContamination, especially when tumor has lots of CNVs
- NIO support in Mutect2 WDL
- Significant speed improvements
- Improved allele fraction estimation
- Initial GVCF output support
-
Mitochondrial Calling
- Added
--mitochondria-mode
toMutect2
andFilterMutectCalls
. This increases sensitivity and only applies filters that are optimized for mitochondria.
- Added
-
New allele frequency / qual score model
- Is now the default in
HaplotypeCaller
andGenotypeGVCFs
- Optimized for greater speed, should resolve many
GenotypeGVCFs
memory issues - Rare numerical finite precision issues in the allele-specific qual have been resolved
- Is now the default in
-
Major Improvements to the CNV (Copy Number Variation) tools
- The CNV tools are now out of beta.
- This includes the tools:
AnnotateIntervals
,CallCopyRatioSegments
,CollectAllelicCounts
,CollectReadCounts
,CreateReadCountPanelOfNormals
,DenoiseReadCounts
,DetermineGermlineContigPloidy
,FilterIntervals
,GermlineCNVCaller
,ModelSegments
,PostprocessGermlineCNVCalls
,PreprocessIntervals
,PlotDenoisedCopyRatios
, andPlotModeledSegments
- This includes the tools:
- Completed the
GermlineCNVCaller
(gCNV) pipeline and made various performance/runtime improvements to both the methods and WDLs. - Major changes include the addition of new tools (
PostprocessGermlineCNVCalls
,FilterIntervals
, andCollectReadCounts
, which replacesCollectFragmentCounts
), as well as improvements to existing tools (notably,AnnotateIntervals
). - Improved support for various formats, namely VCF output in the gCNV pipeline, IGV-compatible .seg output in the
ModelSegments
somatic CNV pipeline, and CRAM support for all CNV WDLs. - Developed tools and WDLs for tagging and filtering of germline events in the
ModelSegments
somatic CNV pipeline.
- The CNV tools are now out of beta.
-
Funcotator Official Release
- Funcotator is now out of beta
- Huge number of bug fixes and accuracy improvements. Output for several fields is now more correct than other well-known functional annotation tools.
- Some new features include:
- MAF output support
- NIO support for datasources
- gnomAD support
- dbsnp support
- Support for Mitochondrial amino acid sequence/protein change strings
- 5'/3' flank support
- Major performance improvements due to added caching
- Added ALL mode for transcript selection (
--transcript-selection-mode ALL
) which will output full annotation fields for all transcripts
- Created a new
FuncotatorDataSourceDownloader
tool to download data sources - Added an experimental
FilterFuncotations
tool
-
MarkDuplicatesSpark is now a Validated, Scalable Replacement for MarkDuplicates
- MarkDuplicatesSpark is now out of beta
- Rewritten version of the tool matches Picard
MarkDuplicates
output and has greatly improved performance and scalability - Supports multiple BAM inputs
- Indexes BAM outputs on-the-fly in parallel on a cluster
-
Additional Tools Ported from GATK3
- Ported
VariantAnnotator
- Ported
VariantEval
- Ported
FastaAlternateReferenceMaker
andFastaReferenceMaker
- Ported
LeftAlignAndTrimVariants
- Restored
GenotypeGVCFs
--include-non-variant-sites
argument
- Ported
-
Major Improvements to the SV (Structural Variation) Tools
- Improvements to collection and calling of events based on discordant read pair evidence.
- A new scaffolding algorithm greatly improves the contiguity of local assemblies, increasing sensitivity.
- Regions of excessive sequencing depth are excluded from evidence collection and assembly, improving runtime performance.
- A major overhaul of our algorithm for calling events based on local assemblies improves accuracy and allows for the accurate reporting of small complex SVs.
- A machine learning (xgBoost) based classifier for SV evidence improves runtime and increases accuracy by determining which regions should be fed into the local assembly workflow.
-
Spark Improvements
- New Disq Spark library allows faster and more accurate loading of formats like BAM and VCF
HaplotypeCallerSpark
now has a "strict mode" that closely matches the regularHaplotypeCaller
- Created
RevertSamSpark
, a parallelized Spark version of Picard'sRevertSam
tool - Migrated most Spark tools that take a reference and/or VCF to use Spark's intrinsic file copying mechanism instead of broadcast to distribute the reference and VCFs to worker nodes -- a big performance win!
-
GenomicsDB Improvements
- Allele-specific annotation support
- Multi-interval support (with some performance caveats)
- Support for sites-only queries
- Support for returning the GT field in queries
- New protobuf-based API to allow configuration without editing JSON files
- Added in machinery to allow per-annotation combine operations to be specified
- Allow for hdfs and gcs URI's to be passed to GenomicsDB
- Migrated from
com.intel.genomicsdb
toorg.genomicsdb
-
"Goodies" Worth Mentioning
- Added fasta.gz support to the
-R/--reference
argument in walker tools SelectVariants
can now drop specific annotation fields from the output vcfCalculateGenotypePosteriors
now supports indels- New tool
ReblockGVCF
to merge reference blocks in single-sample GVCFs for smaller filesizes - Improved MQ calculation accuracy, especially at sites with many uninformative reads; concomitant with new annotation tag and format
- The
-L
argument now supports GCS (Google Cloud Storage) for interval list files / bed / vcf files in walker tools - Added support for "Requester Pays" GCS (Google Cloud Storage) buckets via new
--gcs-project-for-requester-pays
argument - Added GCS (Google Cloud Storage) output (-O) support to more tools
- Improved Python integration (eliminated timeouts and reliance on prompt synchronization) means fewer glitches during runs of ML-based tools
- A significantly (~33%) smaller GATK docker image
- Changed argument tagging syntax from "--arg tag:value" to "--arg:tag value"
- Affects command-line interface for
VariantRecalibrator
,VariantEval
,VariantFiltration
, andVariantAnnotator
- Affects command-line interface for
- Added fasta.gz support to the
Changes between versions 4.0.12.0 and 4.1.0.0 only:
- Many tools are now out of beta and ready for production use!
- `CNNScor...
4.0.12.0
Highlights of this release include support for outputting phased variants in HaplotypeCaller
/Mutect2
, restoring the --include-non-variant-sites
argument to GenotypeGVCFs
, a port of the GATK3 tool VariantEval
, a new library (Disq, https://github.com/disq-bio/disq) for working with BAM/CRAM/VCF/etc. formats on Spark, and GCS (Google Cloud Storage) support in Funcotator
.
As usual, a docker image for this release can be downloaded from https://hub.docker.com/r/broadinstitute/gatk/
Full list of changes in this release:
-
HaplotypeCaller
/Mutect2
- Output VCF spec-compliant phased variants in HaplotypeCaller and Mutect2
- Added an experimental adaptive pruning option for local assembly (#5473)
- Improved implementation of allele-specific new qual (#5460)
- Use cigar complexity to break ties in uninformative reads' best haplotypes (#5359)
- Improved handling of regions that are too short after trimming in HaplotypeCaller and in Mutect2 (Closes issue #5079)
- Optimization in
CigarUtils
to shortcut to M-only CIGAR when provably optimal (#5466) - Changed SUPPORTED_ALLELES_TAG from SA to XA (#5418)
-
HaplotypeCaller
-
Mutect2
- Big improvements to CalculateContamination's model for determining hom alt sites (#5413)
- Reduce false negatives from mapping quality filter on long indels in Mutect2 (#5497)
- Added a mismatch ratio option in realignment filter (#5501)
- Made Mutect2 read position filter default much less stringent (#5487)
- Fixed M2 bug for germline resources with AF=. (#5442)
- Fix read position annotation bug in M2 filter (#5495)
- Cleaner Mutect2 VCF fields (#5510)
- Moved PerAlleleAnnotations to the INFO field (#5518)
- Removed unnecessary inheritance of M2 filtering arguments collection (#5498)
-
GenotypeGVCFs
- Restored the --include-non-variant-sites argument from GATK3 to GenotypeGVCFs (#5219)
-
Ported the GATK3 tool
VariantEval
to GATK4 (#5043) -
Replaced the Hadoop-BAM library with the newly-developed Disq library (https://github.com/disq-bio/disq) for efficiently working with BAM/CRAM/VCF/etc. formats on Spark (#5138)
- Improves Spark performance across-the-board, and fixes many edge-case bugs in Hadoop-BAM
-
Funcotator
- Added GCS support to Funcotator data sources, so that data sources can now be accessed directly from GCS buckets (#5425)
- Added support for annotating 5'/3' flanks (#5403)
- Funcotator now creates default annotations for difficult variants. (#5374)
- Funcotator now can create annotations for symbollic alleles and masked alleles (#5406)
- Funcotator now can match between hg19 and b37 data sources. (#5491)
- Added in regression tests and fixes for correctness of many annotations (#5302)
- Now DE_NOVO_START_IN_FRAME and DE_NOVO_START_OUT_FRAME are correct. (#5357)
- Added cDNA Strings for Intronic Variants (#5321)
- VCF data sources create an ID field for the ID of the variant
used for the annotation (#5327) - Funcotator now computes MT protein changes. (#5361)
- Funcotator now correctly populates transcript position. (#5380)
- Added a script that can create data sources from BED files. (#5438)
- Updated testing Gencode data sources to fully exercise test data set (#5423)
- Moved validation test data out of large files area. (#5381)
- Updated top-level class documentation for Funcotator. (#4655)
- Added scripts to liftover gnomAD. Also bugfixes for Funcotator NIO. (#5514)
-
HaplotypeCallerSpark
-
MarkDuplicatesSpark
: Added a few of the remaining unimplemented useful features from Picard (#5377) -
CNV workflows
- Changed
FilterIntervals
to operate on the intersection of intervals in all inputs. (#5408) - Fixed RAM usage parameter error in combine_tracks.wdl (#5358)
- Various other improvements to combine_tracks.wdl (#5384)
- Fixed gCNV WDL broken by Cromwell update on FireCloud. (#5407)
- Replaced bash script in gCNV ScatterIntervals task with updated version of IntervalListTools. (#5414)
- Changed
-
CNNScoreVariants
- Check for and require hardware AVX support (#5291)
-
Changed
SelectVariants
so that it can handle multiple rsIDs separated by ';' in a VCF file (#5464) -
Miscellaneous Changes
- Added
setIsUnplaced()
to theGATKRead
API to distinguish reads with no mapping information (#5320) - Fixed an integer overflow bug in the
RMSMappingQuality
annotation (#5435) - Fixed floating-point bug in MannWhitneyU on some JVMs. (#5371)
- Standardized the output argument for
LeftAlignIndels
(#5474) SplitIntervals
now produces an.interval_list
file (#5392)- Fixed a bug with GATK_GCS_STAGING in the GATK launcher script #1338 (#5452)
- Added ExampleReadWalkerWithVariantsSpark.java and tests (#5289)
- Add description getter and javadoc in GATKReportTable (#5443)
- Fixed message in GATKAnnotationPluginDescription (#5444)
- Replaced some uses of PrintWriter (#5461)
- Refactor GVCFWriter to allow push/pull iteration. (#5311)
- Add scripts/dataproc-cluster-ui to release bundle. (#5401)
- Marked
VariantAnnotator
as a@DocumentedFeature
(#5480) - Removed obsolete intel conda environment references. (#5482)
- Deleted the CountSet class (#5467)
- Test framework: disabled gcloud login on travis for non-cloud non-wdl tests (#5335)
- Updated Spark scripts to reflect changes from #5386 and #5127. (#5415)
- Fixed jexl logging and updated VariantFiltration doc. (#5422)
- Fixed some dead links in the README (#5405)
- Added
-
Dependencies
4.0.11.0
A release which includes major improvements to Mitochondrial calling in Mutect2 as well as bug fixes and improvements:
As always a docker is available here: https://hub.docker.com/r/broadinstitute/gatk/
Mutect2 and HaplotypeCaller changes:
-
Added
--mitochondria-mode
toMutect2
andFilterMutectCalls
. This increases sensitivity and only applies filters that are optimized for mitochondria. A best practices WDL for calling mitochondrial variants on WGS data will be available in the future. (#5193) -
Strand based annotations will use both reads in an overlapping read pair (#5286)
-
Realignment filter annotates the VCF with passing and failing read counts (#5328)
-
New filters and annotation to support blood biopsy that count and filter based on N's at variant sites (#5317)
-
Fixed bug for M2 GGA alleles with zero coverage (#5303)
-
Fixed error in genotype given alleles mode when input alleles have genotypes (#5341) #5336
-
Add new annotations to bamout to make understanding calls easier (#5215)
-
Fixed a typo.
CNV Pipeline:
- Added FilterIntervals to perform annotation-based and count-based filtering in the gCNV pipeline. (#5307) closes #2992 #4558
Spark:
- Removed WellformedReadFilter from CountReadsSpark (#5329)
- Support fasta.gz in GATKSparkTool (#5290) closes #5258
Other:
- CNN variant update models validate scores cleanup training (#5175)
- combine_tracks.wdl supports GISTIC2 conversion (and bugfix) (#5287) closes #5284 #5283
- handle normal reads in validation sample in BasicSomaticValidator (#5322)
GenomicsDB:
- Allow for hdfs and gcs URI's to be passed to GenomicsDB (#5197)
SelectVariants:
SplitNCigarReads:
- Added defensive check to OverhangFixingManager splices for non-reference spanning reads (#5298) closes #5293
- Fixed SplitNCigarReads ArrayIndexOutOfBounds error for reads with long deletions (#5285) closes #5230
Testing:
4.0.10.1
This is a small release that improves the calculation of the MQ
(mapping quality) annotation, which provides an estimate of the overall mapping quality of reads supporting a variant call. It also introduces a number of experimental improvements to the CNV workflows, as well as a bug fix to LocusWalkerSpark
.
As usual, a docker image for this release can be downloaded from https://hub.docker.com/r/broadinstitute/gatk/
Full list of changes in this release:
-
Improve MQ calculation accuracy (#4969)
- Change raw MQ to a tuple of (sumSquaredMQs, totalDepth) for better accuracy where there are lots of uninformative reads or called single-sample variants with homRef genotypes.
- Note that incorporating this change into a pipeline will require a concomitant update to this version for GenomicsDBImport and GenotypeGVCFs.
-
Updated
SimpleGermlineTagger
and somatic CNV experimental post-processing workflow with several experimental changes that improve precision results, and expand possible evaluations, of GATK CNV (#5252)- New script
combine_tracks.wdl
for post-processing somatic CNV calls. This wdl will perform two operations:- Increases precision by removing:
- germline segments. As a result, the WDL requires the matched normal segments.
- Areas of common germline activity or error from other cancer studies.
- Converts the tumor model seg file to the same format as AllelicCapSeg, which can be read by ABSOLUTE. This is currently done inline in the WDL.
- This is not a trivial conversion, since each segment must be called whether it is balanced or not (MAF =? 0.5). The current algorithm relies on hard filtering and may need updating pending evaluation.
- For more information about AllelicCapSeg and ABSOLUTE, see:
- Carter et al. Absolute quantification of somatic DNA alterations in human cancer, Nat Biotechnol. 2012 May; 30(5): 413–421
- https://software.broadinstitute.org/cancer/cga/absolute
- Brastianos, P.K., Carter S.L., et al. Genomic Characterization of Brain Metastases Reveals Branched Evolution and Potential Therapeutic Targets (2015) Cancer Discovery PMID:26410082
- Increases precision by removing:
- Changes to GATK tools to support the above:
SimpleGermlineTagger
now uses reciprocal overlap to in addition to breakpoint matching when determining a possible germline event. This greatly improved results in areas near centromeres.- Added tool
MergeAnnotatedRegionsByAnnotation
. This simple tool will merge genomic regions (specified in a tsv) when given annotations (columns) contain exact values in neighboring segments and the segments are within a specified maximum genomic distance.
- New scripts
multi_combine_tracks.wdl
andaggregate_combine_tracks.wdl
which runcombine_tracks.wdl
on multiple pairs and combine the results into one seg file for easy consumption by IGV.
- New script
-
LocusWalkerSpark
: fix issue where intervals with no reads were being dropped (#5222)- This fixes the bug reported in #3823
-
Added
SparkTestUtils.roundTripThroughJavaSerialization()
method for better serialization testing on Spark (#5257) -
Build system: set the same compiler flags for all gradle JavaCompile tasks (#5256)
4.0.10.0
Highlights of this release include a new tool ReblockGVCF
, a bug fix for a crash in Mutect2
, and a more efficient distribution mechanism for the reference and VCFs in Spark tools.
As usual, a docker image for this release can be downloaded from https://hub.docker.com/r/broadinstitute/gatk/
Full list of changes in this release:
-
Added a new experimental tool
ReblockGVCF
(#4940)- A tool to merge reference blocks in single-sample GVCFs for smaller filesizes
-
Mutect2
:- Fixed a bug in the
PalindromeArtifactClipReadTransformer
(#5241)- This filter would crash with an out-of-bounds error for fragment lengths and/or mate start positions that went off the end of a contig.
- Changed the way the log10AlleleFractions are calculated in
SomaticLikelihoodsEngine
: now we use the mean of the posterior of the allele fractions. (#5231) - Reword comments in Mutect2 WDL to not refer to the old orientation bias filter as deprecated. (#5196)
- Cited CGA in Mutect docs (#5228)
- Fixed a bug in the
-
HaplotypeCaller
: Allow MNP calling in GVCF mode with stern warnings about not trying joint-genotyping from the resulting GVCFs. (#5182)HaplotypeCaller
will now allow you to output MNPs in GVCF mode with a warning, however since joint genotyping of MNPs is unsupported,CombineGVCFs
andGenomicsDBImport
will now refuse to process GVCFs containing MNPs.
-
GATK Spark tools
:- Migrated most Spark tools that take a reference and/or VCF to use Spark's intrinsic file copying mechanism instead of broadcast to distribute the reference and VCFs to worker nodes (#5127) (#5221)
- This improves the performance of Spark tools that take a reference and/or VCF as side inputs, as the new distribution mechanism doesn't load the entire contents of the files into memory like broadcast did.
- As a side effect of this change, support for 2bit references has been removed from tools that were migrated to the new distribution mechanism (in particular,
BaseRecalibratorSpark
andHaplotypeCallerSpark
). - The CNV Spark tools have not yet been migrated, and still support 2bit references for now.
- Bug fix: ensure that intervals with no reads are not dropped by the
SparkSharder
(#5248)
- Migrated most Spark tools that take a reference and/or VCF to use Spark's intrinsic file copying mechanism instead of broadcast to distribute the reference and VCFs to worker nodes (#5127) (#5221)
-
Funcotator
: -
Fix a multithreaded race condition in
GenotypeLikelihoodCalculators
by synchronizing updates of shared genotype likelihood tables. (#5071)- This bug affected
HaplotypeCallerSpark
, but not the regularHaplotypeCaller
- This bug affected
-
GenomicsDB
: added in machinery to allow per-annotation combine operations to be specified (#4993) -
GATK Engine
: Hooked upCountingVariantFilter
toVariantWalkers
(#4954) -
StreamingPythonScriptExecutor
: added a new message to theStreamingProcessController
ack FIFO protocol to allow additional message detail to be passed as part of a negative ack. (#5170)- This improves exception message propagation for fatal errors when running Python tools.
-
gCNV WDLs
:- Tar calls from all samples. (#5225)
- This fixes an issue where the gCNV WGS cohort germline WDL was outputting vcf files with names that do not correspond to the actual samples inside the files.
- Added multi-sample functionality to gCNV case mode WDL, and added a wrapper for gCNV case mode WDL to help optimize cloud computation cost. Also optimized how data is sent to postprocessing task in gCNV WDLs. (#5176)
- Tar calls from all samples. (#5225)
-
gCNV kernel
: Enforced ViterbiSegmentationEngine to analyze single samples only (#5176) -
Added a
dataproc-cluster-ui
script to easily open the Spark UI on dataproc clusters (#5188) -
Fixed pom issues that prevented publishing to maven central (#5224)
-
Added
tabix
to the docker base image (#5247)