SERVE harmonizes various sources of evidence into a single unified model that can be readily used in genomic analyses:
- A model is generated which allows mapping of genomic events to clinical evidence.
- An overview of mutations that are implied to be potential cancer drivers is generated.
- Which knowledgebases are supported?
- What is generated as final SERVE output?
- How are genomic events extracted from the source knowledgebases?
- What is done in terms of curation and harmonization?
- How does SERVE deal with multiple reference genome versions?
- How does everything come together?
- Version history and download links
SERVE supports the ingestion of the following knowledgebases:
- CGI - general purpose knowledgebase that is supported through VICC
- CIViC - general purpose knowledgebase that is supported through VICC
- CKB CORE - part of CKB's knowledgebase that is supported through VICC
- CKB FLEX - The complete CKB clinical database.
- OncoKB - general purpose knowledgebase that is supported through VICC
- DoCM - database containing pathogenic mutations in cancer
- iClusion - a database with all actively recruiting clinical trials in the Netherlands
- ACTIN - a database with all actively recruiting clinical trials in the ACTIN study along with molecular inclusion criteria for these trials.
- HMF Cohort - a database of recurrent somatic mutations in cancer-related genes from the Hartwig database.
- HMF Curated - a database of known driver mutations curated by the Hartwig team.
Support for the following knowledgebases is under development:
- CBG Compassionate Use - a database of approved compassionate use programs in the Netherlands
A number of other Hartwig modules support the ingestion (and analysis) of these knowledgebases:
- VICC Importer: A module supporting the ingestion of any knowledgebase ingested into VICC.
- iClusion Importer: A client implementation of the iClusion API which transforms iClusion API output to data that can be ingested into SERVE.
- CKB Importer: A module supporting the ingestion and analysis of CKB FLEX.
Do note that SERVE does not provide the actual data that is input to the algorithm, but only provides support for its ingestion. While SERVE itself is open-source, the sources that can be ingested have their own licensing and up to the users to make sure they are compliant with the usage of the data itself.
SERVE generates clinical evidence in the following datamodel:
- Treatment (name of trial or drug(s))
- Cancer type (annotated with DOID) for which the treatment is considered on-label.
- Blacklist cancer types (annotated with DOID) that should be children of the main cancer type and are used for blacklisting specific types of the main cancer type.
- Tier / Evidence level of the treatment
- Direction (Responsive for the treatment or resistant to the treatment)
- A set of URLs pointing towards the source website which provide extra information about the treatment.
- A set of URLs with extra information about the evidence (e.g. publications backing up the evidence)
The following genomic events and tumor characteristics can be mapped to clinical evidence:
- Genome-wide tumor characteristics such as signatures, MSI status, TML status or viral presence
- Multi-gene events such as gene fusions
- Single gene events such as amplification or general (in)activation of a gene
- Types of mutations in ranges overlapping with specific genes such as:
- inframe insertions in EGFR exon 20
- splice site mutations in MET exon 14
- any type of missense mutation in BRAF codon 600
- Specific missense mutations such as BRAF V600E
In addition to generating a mapping from various genomic events to clinical evidence, SERVE generates the following outputs describing genomic events implied to be able to driver cancer:
- Specific known pathogenic fusion pairs
- Known pathogenic amplifications and deletions
- Known pathogenic exons (exons for which specific mutations are implied to be pathogenic)
- Known pathogenic codons (codons for which generic mutations are implied to be pathogenic)
- Known pathogenic hotspots (specific mutations on specific loci)
Evidence that is defined on a gene-level is checked to make sure that the gene exists in Hartwig's definition of the exome. If a gene does not exist in Hartwig's exome definition the evidence is ignored. For more information about Hartwig's definition of the exome, see HMF Gene Utils.
For fusions, genes are permitted that can exist in the context of a fusion pair (eg @IG genes).
In the supported knowledgebases, there can be events defined on genes that are are not part of the Hartwig's driver gene panel. Also, the event could be inconsistent with respect to driver gene panel (e.g. "Inactivation" evidence for a gene that is configured to be an oncogene). There are various ways to deal with such inconsistencies on a per-knowledgebase level:
Filter | Description |
---|---|
FILTER | We filter every entry when the gene/event isn't present or there is an inconsistency with the Hartwig's driver gene panel |
IGNORE | Every gene/event is used regardless of mismatch/inconsistencies |
WARN_ONLY | Every gene/event is used regardless of mismatch/inconsistencies, however a warning messages is shown for the inconsistencies |
Evidence on SNVs and small INDELs generally come in their protein annotated form (e.g. BRAF V600E). SERVE uses transvar to resolve these annotations into genomic coordinates (referred to as hotspots) for the reference genome version that is used by the input knowledgebase.
The first step is to choose what ensembl transcript to use for converting protein annotation back to genomic coordinates:
- If the knowledgebase configured a transcript for a mutation, that transcript is used exclusively.
- If no transcript is configured, SERVE uses the typical transcript used by Hartwig which is generally the canonical transcript defined by ensembl.
- If a protein annotation does not exist on the canonical transcript and has no transcript configured in the knowledgebase, a consistently specific transcript is picked for protein annotation in case multiple transcripts imply the same hotspot.
If a protein annotated form does not exist on any transcript for a specific gene, the evidence is ignored (see also curation).
Assuming a suitable transcript has been found, N hotspots are derived for each protein annotation as follows:
- In case the mutation is caused by SNV or MNV every possible trinucleotide combination that codes for the new amino acid is generated.
- In case the mutation is caused by a duplication (DUP) or an inframe deletion (DEL) 1 hotspot is generated which assumes
the exact reference sequence has been duplicated or deleted.
- In case a DEL can be left-aligned a hotspot is generated for every position between the left-aligned position and the actual position.
- In case the mutation is caused by an inframe insertion (INS) there are two flavors based on the length of the insertion:
- In case 1 amino acid is inserted, hotspots are generated for every trinucleotide coding for that amino acid.
- In case multiple amino acids are inserted, one of the potentially many hotspots is generated. This is just for practical reasons to put a limit on the (exponential) number of variants that can code for a multi-amino-acid insert.
- In case of a complex deletion/insertion (DELINS) the rules for hotspot generation for deletions and insertions are extrapolated. Hence, the reference sequence is assumed to be deleted, and one new nucleotide sequence is inserted unless the insertion is 1 amino acid in which case hotspots are generated for all trinucleotides coding for the inserted amino acid. Complexity of the resulting variant is reduced by removing any bases that are shared between ref and alt at start or end of the variant.
- In case of a frameshift the following hotspots are generated:
- Any of the 12 possible single base inserts inside the affected codon that does not lead to synonymous impact in the affected codon
- Any of the 3 possible single base deletes inside the affected codon that does not lead to synonymous impact in the affected codon
- Any of the 2 possible double base deletes inside the affected codon that does not lead to synonymous impact in the affected codon
Additionally, hotspot generation is ignored for any INDEL that spans multiple exons. Examples are:
- A DUP which duplicates a codon that is encoded by parts of two separate exons.
- A frameshift which shifts into the intronic space of the gene.
Finally, Any INDEL longer than 50 bases is ignored since this is considered to be a structural variant rather than a small INDEL.
For evidence defined on a codon or exon level, no protein annotation resolving is done. Instead, genomic coordinates are resolved using the following rules.
First off, evidence on codons and exons are assumed to be defined with respect to the Hartwig canonical transcript.
- If evidence for a specific codon or exon range is defined for a different transcript, this evidence is ignored. Since all variants in Hartwig are annotated in terms of their impact on the Hartwig canonical transcript, resolving this evidence could potentially lead to wrong matching.
- If no transcript is configured in the knowledgebase, it is assumed the canonical transcript is implied.
For ranges that represent exons, the range is extended by 10 bases on both sides of the exon to be able to capture splice variants affecting the exon.
In addition to resolving coordinates, every codon and exon range is annotated with a filter indicating which type(s) of mutations are valid for this range. SERVE tries to determine this based on the information specified in the knowledgebase, but if that information is not sufficient, the Hartwig driver catalog is used to determine the filter.
Filter | Description |
---|---|
NONSENSE_OR_FRAMESHIFT | Only frameshifts or nonsense mutations are valid for this range |
SPLICE | Only splice mutations are valid for this range |
INFRAME | Any inframe INDEL (insert or delete) is valid for this range |
INFRAME_DELETION | Only inframe deletions are valid for this range |
INFRAME_INSERTION | Only inframe insertions are valid for this range |
MISSENSE | Only missense mutations are valid for this range |
ANY | Any mutation is considered valid for this range. |
For evidence that is applicable when a gene-wide level event has happened, the type of event required to match evidence to a mutation is derived from the knowledgebase event. In case a knowledgebase provides insufficient details to make a decision, the Hartwig driver catalog is used to determine what event qualifies for the evidence.
Gene level event | Description |
---|---|
AMPLIFICATION | Evidence is applicable when the gene has been amplified. |
DELETION | Evidence is applicable when the gene has been completely deleted from the genome. |
ACTIVATION | Evidence is applicable when a gene has been activated. Downstream algorithms are expected to interpret this. |
INACTIVATION | Evidence is applicable when a gene has been inactivated. Downstream algorithms are expected to interpret this. |
ANY_MUTATION | SERVE does not restrict this evidence based on the type of mutation and considers every type of mutation applicable for this evidence. |
FUSION | Evidence is applicable in case the gene has fused with another gene (either 3' or 5'). |
WILD_TYPE | Evidence is applicable in case no genomic alteration is detected) |
For evidence on fusion pairs, SERVE can add restrictions on which exons are allowed to be involved in the fusion. This is to support evidence on fusions like EGFRvII.
Evidence on fusion pairs where these restrictions are missing can be assumed to be valid for any fusion between the two genes specified.
For evidence that is applicable when a genome wide event has happened, the type of event required to match evidence to the event is derived from the knowledgebase event. When the knowledgebase event has a cutoff defined for this evidence this information will be also extracted. When no cut-off values is present but is expected for the characteristics, Hartwig's default cutoff values are used.
Genome wide event | Description |
---|---|
MICROSATELLITE_UNSTABLE | Evidence is applicable when the genome has a MSI status (Hartwig's cutoff >=4) |
MICROSATELLITE_STABLE | Evidence is applicable when the genome dopes not have a MSI status (Hartwig's cutoff <4) |
HIGH_TUMOR_MUTATIONAL_LOAD | Evidence is applicable when the genome has a high tumor mutational load status (Hartwig's cutoff >=140) |
LOW_TUMOR_MUTATIONAL_LOAD | Evidence is applicable when the genome does not have a high tumor mutational load status (Hartwig's cutoff <4) |
HOMOLOGOUS_RECOMBINATION_DEFICIENT | Evidence is applicable when the genome has a HRD status (Hartwig's cutoff >= 0.5) |
HPV_POSITIVE | Evidence is applicable when viral presence of some form of HPV has been found |
EBV_POSITIVE | Evidence is applicable when viral presence of some form of EBV has been found |
IMMUNO_HLA / Evidence is applicable in case of an HLA type match |
Every patient has a specific HLA Class type I in their germline. If this class matches to HLA class type I which is derived from the knowledgebase this patient is applicable for the evidence.
Per knowledgebase curation and filtering is applied to harmonize knowledge from different sources and to correct/remove mistakes or evidence that is inconsistent with HMF driver model.
For VICC the following curation and filtering is applied prior to presenting the data to SERVE:
- General filtering of mutations that are undetectable when analyzing DNA or RNA. Examples are phosphorylation and methylation.
- Determining whether the evidence is supportive of the specified direction. Eg if evidence "does not support" sensitivity we do not generate actionable results from this evidence.
- Filtering of specific mutations:
- Mutations that remove the stop codon. These are simply not interpreted yet by the SERVE main algorithm.
- Synonymous mutations in coding regions are assumed to be benign by SERVE and ignored.
- Fusions that are not considered pathogenic by Hartwig are removed for lack of evidence of pathogenicity (regardless of their level of evidence).
- Events that contradict Hartwig driver catalog. One example is "CCND3 loss" which is assumed to be benign.
- Curation of specific mutations:
- SNVs/INDELs that are not aligned correctly according to HGVS standards are corrected to be HGVS-compliant.
- SNVs/INDELs that have correct notation but simply don't exist on the transcript specified by VICC are removed.
- Fusion pairs for which the genes are in the wrong order are flipped around.
- Genes which are synonyms of genes used in the Hartwig exome definition are renamed.
- Correction of cancer types and DOID annotation:
- Evidence for which DOID is missing have a DOID manually assigned.
- Evidence on multiple cancer types generally get a wrong DOID assigned by VICC and are rectified.
- Correction of drugs for which A or B level evidence exists:
- A whole range of drugs have wrong or inconsistent names in VICC and are rectified by SERVE.
- VICC does not explicitly model the difference between "multiple different drugs" and a "combination treatment of multiple drugs". This gets rectified by SERVE.
DoCM is used exclusively for known hotspot generation. The filtering is therefore tailored for hotspots:
- Entries implying general codon mutations are removed.
- Unusual notations for inframe deletions and insertions are removed.
- Mutations that don't exist on the transcript specified by DoCM are removed.
Also, genes that do not follow HGNC model are renamed to their HGNC name.
ACTIN ingest the molecular inclusion criteria which are extracted from the ACTIN treatment database (see also actin). The inclusion criteria in the trials of the ACTIN database are defined in terms of specific rules.
Rule | When does a patient pass evaluation? |
---|---|
ACTIVATION_OR_AMPLIFICATION_OF_GENE_X | Activating mutation or amplification is found in gene X |
ACTIVATING_MUTATION_IN_GENE_X | Activating mutation is found in gene X |
FUSION_IN_GENE_X | Driver fusion with fusion partner gene X is found |
SPECIFIC_FUSION_OF_X_TO_Y | Driver fusion with 2 specified fusion partner genes is found |
INACTIVATION_OF_GENE_X | Inactivating mutation or deletion/disruption is found in gene X |
MUTATION_IN_GENE_X_OF_TYPE_Y | Specific mutation Y is found in gene X |
AMPLIFICATION_OF_GENE_X | Amplification is found in gene X |
DELETION_OF_GENE_X | Deletion is found in gene X |
WILDTYPE_OF_GENE_X | No driver mutation is found in gene X |
MSI_SIGNATURE | MS Status = MSI |
HRD_SIGNATURE | HR Status = HRD |
TMB_OF_AT_LEAST_X | Tumor Mutational Burden (TMB) should be => X |
TML_OF_AT_LEAST_X | Tumor Mutational Load (TML) should be => X |
TML_OF_AT_MOST_X | TML should be <= X |
HAS_HLA_A_TYPE_X | HLA typing should be X |
SERVE configures every trial to A-level evidence with responsive direction. The filtering is predominantly configurable rather than fixed in SERVE. The following filters can be configured in ACTIN:
Filter | Description |
---|---|
FILTER_RULE_ON_GENE | Can be used to remove evidence of a specific rule of a particular gene |
FILTER_MUTATION_ON_GENE | Can be used to remove evidence of a gene with a specific mutation |
FILTER_EVERYTHING_FOR_GENE | Can be used to remove all evidence of a gene (eg. Mutations that are inconsistent with the Hartwig driver catalog) |
FILTER_EVERYTHING_FOR_RULE | Can be used to remove all evidence of a specific rule |
iClusion contributes to actionability only. SERVE configures every trial to B-level evidence with responsive direction. SERVE only considers trials with one or more molecular inclusion criterium. The filtering is predominantly configurable rather than fixed in SERVE. The fixed curation of iClusion that is done in SERVE is mapping gene names and signatures.
The following filters can be configured in iClusion:
Filter | Description |
---|---|
FILTER_EVENT_WITH_KEYWORD | Can be used to remove evidence of a type that is not observable in DNA (eg "hypermethylation") |
FILTER_VARIANT_ON_GENE | Can be used to remove evidence of a gene (eg. Mutations that are inconsistent with the Hartwig driver catalog) |
Finally, cancer types for which no DOIDs have been specified get a DOID assigned by SERVE.
For CKB FLEX curation and filtering is predominantly configurable rather than fixed in SERVE. The only fixed curation done in SERVE is mapping evidence for tumor characteristics (such as MSI or High TMB) to actual characteristics since CKB FLEX models this as "genes".
The following filters can be configured for CKB FLEX, along with an example of how this is used by Hartwig:
Filter | Description |
---|---|
ALLOW_GENE_IN_FUSIONS_EXCLUSIVELY | CKB FLEX uses a hierarchy of events in such a way that every "fusion" is a child of "mutant". For certain genes (eg @IG) we want to ignore the abstract level and only include the fusion evidence since we only handle @IG on a fusion level in the Hartwig pipeline. |
FILTER_EVENT_WITH_KEYWORD | Can be used to remove evidence of a type that is not observable in DNA (eg "hypermethylation") |
FILTER_EXACT_VARIANT_FULLNAME | Any specific variant can be removed through this filter. This is primarily used to remove variants that have a coding impact on their configured refseq transcript in CKB but are non-coding or don't exist on Hartwig's ensembl transcript. |
FILTER_ALL_EVIDENCE_ON_GENE | Is primarily used to remove evidence on genes which are simply not modeled correctly in Hartwig's gene model and hence can't be mapped properly |
FILTER_EVIDENCE_FOR_EXONS_ON_GENE | Some genes may have evidence on specific exons which don't exist on the ensembl transcript used by Hartwig |
FILTER_SECONDARY_GENE_WHEN_FUSION_LEG | Usage of this filter is similar to the use case for removing all evidence on genes. |
External knowledgebases generally define their knowledge for one specific reference genome version (v37 or v38). SERVE merges knowledgebases defined in either v37 or v38 reference genome versions. In addition SERVE generates its output for both reference genome v37 and v38.
Knowledge is extracted from any knowledgebase with respect to the ref genome version of that knowledgebase. The following resources are used in a ref-dependent manner during knowledge extraction:
- The reference genome fasta file
- The definition of Hartwig driver genes
- The definition of Hartwig known fusions
- The Hartwig ensembl data cache
Once per-knowledgebase extraction is done, the extraction results are merged into a v37 and v38 version. Any extraction for a v37 knowledgebase is taken over unchanged in the v37 output, and the same holds for any v38 knowledgebase into the v38 output. For the remaining cases, the following conversion algo is executed.
Hotspots and ranges are lifted over using HTSJDK's implementation of UCSC LiftOver. In case lift-over could not be performed a warning is raised unless the position is known to not exist in the target ref genome. In addition, any lift-over that lifts a position towards a different chromosome is raised as a warning but is accepted nonetheless.
There are a few additional checks for specific types of knowledge:
- A hotspot lift-over is accepted only in case the reference at the lifted position has remained unchanged.
- A codon lift-over is accepted only in case the lifted range has a length of 3 bases.
- Transcripts and codon/exon indices are removed from known codons and exons since they can't be trusted anymore after lift-over
Genes are lifted between reference genome using Hartwig's internal gene mapping. This impacts the following types of events:
- Known and actionable ranges
- Known and actionable copy numbers
- Actionable gene events
- Known and actionable fusion pairs
Do note that for fusion pairs, additional annotation remains unchanged assuming exonic ranges relevant for known and actionable fusions remain identical between ref genome versions.
In case the genomic region of a gene has been flipped between v37 and v38 we exclude the gene from liftover.
Every knowledgebase can be enabled or disabled through configuration. SERVE starts with reading the various knowledgebases which are enabled. Knowledge is extracted after applying filtering and curation. A knowledgebase can contribute to known and/or actionable events. Current configuration as follows:
Knowledgebase | Ref genome version | Contributes to known events? | Contributes to actionable events? |
---|---|---|---|
ACTIN | v37 | No | Yes |
CKB FLEX | v38 | Yes | Yes |
DoCM | v37 | Yes | No |
Hartwig Cohort | v37 | Yes | No |
Hartwig Curated | v37 | Yes | No |
iClusion | v37 | No | Yes |
VICC | v37 | Yes | Yes |
Knowledge extraction is performed on a per-knowledgebase level after which all events are consolidated as follows:
- All known events are aggregated on a per-event level where every event has a set of knowledgebases in which the event has been defined as pathogenic.
- All actionable events are concatenated. Every actionable event that is present in multiple knowledgebases will be present multiple times in the actionable output.
Within the Hartwig pipeline, SERVE output is used in the following manner:
- The known output is used in various algorithms for various purposes. For example, the known hotspots produced by SERVE are used by SAGE as the definition of the highest tier of calling and HOTSPOT annotation.
- The actionable output is the database that PROTECT bases its clinical evidence matching on.
- Upcoming
- Support the raw input string of the input knowledgebases to the actionable output files
- Support wild-type events as gene level evidences
- Support HLA Class type I as new actionability options
- The filtering of iClusion events is moved to a input file instead inside SERVE
- Created a link of CKB of the evidence for CKB Boost (web based)
- For actionable signatures evidences could be applicable with different cut-offs. Now supporting those cut-off values of the different signatures (eg. TML >= 140 )
- Support the possibility to blacklist specific tumor locations for particular treatments
- Add an option to filter evidences when there are driver inconsistencies
- Support for curation the coordinates of genes because with ensembl data cache BRAF has the wrong coordinates
- We support the interpretation of the new cancer type(DOIDs) of the CKB knowledgebase (JAX:10000009 and JAX:10000008)
- 1.8
- Add full support for HGNC gene model implied by HMF Gene Utils
- Removed gene name mapping from ref genome converter when mapping between 37 and 38 (since gene names are always equal)
- Remove gene name mapping from CKB gene extractor since CKB follows HGNC as well.
- The ensembl data cache is used for resolving genes and canonical transcripts
- Transvar uses new temporarily "new to old" gene mapping
- Support for gene mapping in DoCM which does not follow HGNC
- Add range annotation to (actionable) range:
- Rename exonIndex to exonRank in KnownExons
- Rename codonIndex to codonRank in KnownCodons
- Add transcript, rangeType and rank to ActionableRange
- Support for HRD in CKB
- Add full support for HGNC gene model implied by HMF Gene Utils
- 1.7
- Extend config file from source iClusion
- Extend config file from source CKB
- "Advanced solid tumor" in iClusion is mapped to DOID 162 rather than 0050686 to avoid missing it for tumors with unknown tumor type
- 1.6
- Add filter in VICC extraction to ignore evidence that does not support the direction when generating actionability.
- 1.5
- "Advanced solid tumor" in CKB is mapped to DOID 162 rather than 0050686 to avoid missing it for tumors with unknown tumor type
- 1.4
- Various additional checks to ref genome lift-over (such as filtering of events on genes for which strand has flipped).
- CKB FLEX filtering framework has been added.
- Solve bug when generating hotspots from MNVs that cross exon boundaries
- 1.3
- Support for merging sources that differ in ref genome version (v37 vs v38).
- Support for generating output for both ref genome version v37 and v38.
- Driver catalog warnings are disabled for VICC.
- KnownExons and KnownCodons are sorted more explicitly to make sure files don't change upon identical input.
- An index file is generated for KnownHotspots VCF (for both v37 and v38).
- 1.2
- Consistently pick a specific transcript for hotspot annotation in case multiple transcripts imply the same hotspot.
- Extend splice sites from 5 bases to 10 bases beyond exon boundaries.
- Add support for evidence for actionable viral presence (starting with EBV and HPV presence).
- Add support for evidence of absence of high TMB and MSI.
- Renamed actionable signatures to actionable characteristics.
- 1.1
- Ability to switch every resource on/off independently.
- More predictable sorting of knowledge to ensure identical output on identical input.
- Ability to curate VICC evidence levels in case they are suspicious.
- 1.0
- Initial release.