-
Notifications
You must be signed in to change notification settings - Fork 128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Genome annotations for segmented CDSs etc #1280
Comments
Thanks for you work and thoughts here! Just came across this when trying to make a nice ORF1ab dataset for Nextclade v3. I noticed that |
Auspice now supports multi-part CDS annotations This PR makes `augur translate` output the correct annotation for complex CDSs Complex CDSs are detected through `type(feat.location)` being `SeqFeature.CompountLocation` Currently, this only works with genbank files, as our load_features utility function does not deal with compound CDS in GFF files correctly I tested this with MERSs complex ORF1ab Partially resolves #1280
@jameshadfield I think I've managed to make Can you explain what you mean by "Implement the within-CDS features later on once Auspice supports them?"? |
Nice!
The ability for annotations to define regions within a CDS, e.g. RBD in spike, epitope sites in HA, which Auspice can then show. Here's (slack) a prototype, but the UI hasn't been iterated on much. I don't think any schema's been decided on, but we can do that now if it's helpful? |
Auspice will soon be able to parse an extended version of the genome_annotation which will allow segmented CDSs, wrapping CDSs and extra metadata. We need a way to export this information from Augur.
Schema changes
general
The
nuc
strand cannot be"-"
(-ve strand genomes are represented as their reverse complement). This isn't a change to Auspice's behavior but is now enforced. The strand is optional fornuc
.Each key/object pair in
genome_annotations
now corresponds to a CDS rather than a gene and our language (help / schema etc) should be updated accordingly.CDS length is now verified to be a multiple of 3 (within Auspice) and if it's not the CDS is not displayed.
segmented CDS
The
start
andend
properties may be omitted and replaced with an array of segments, each with 1-based (GFF) coordinates:The order of segments is important and corresponds to the order the respective translations appear in the protein sequence. Note that
start<end
always, even if the CDS is on the negative strand.(The
name
for an individual segment may or may not be part of the schema, I will update this once we've finalised it in Auspice.)wrapping CDS
A wrapping CDS may be expressed via segments (as above) or by specifying an
end
coordinate beyond the length of the genome, following GFF format.(optional) metadata
gene?: string
: Displayed in the on-hover tooltip in Auspice. If multiple CDSs have the same gene then they will be given the same colour by Auspice.color?: string
: User specified colour for the CDS. The value must be a CSS colour string or a colour hex.display_name?: string
: A more verbose name for the CDS, shown in the on-hover info box.description?: string
: Shown in the on-hover info box.(future) within CDS annotations
This isn't yet in Auspice, but will be, so it is important context to consider when implementing the Augur side of things. The syntax may change slightly.
Current augur workflow
(Remember that augur currently can't handle complex CDSs)
augur ancestral
infers nuclotide sequences and adds the "nuc" annotation to the resulting node-data JSONaugur translate
uses a GenBank / GFF file to translate simple CDSs and creates per-CDS (per-gene) annotations in the resulting node-data JSONaugur export
simply passes this annotations block through to the final dataset JSONFuture augur workflow
I believe the typical augur workflow, especially for complex CDSs will be:
augur ancestral
to infer the translated sequences, per CDS, for internal nodes (implemented via this PR). This is probably the place to also generate the annotations block, which means thataugur ancestral
will need to parse the GFF file initially used by Nextclade.augur export
simply passes this annotations block through to the final JSON, as it currently does.I'm not sure what this means for the future of
augur translate
, but the approach we use withinaugur ancestral
to parse the GFF (or GenBank?) file and create the resulting JSON annotations will be able to be used byaugur translate
if we wish to do so in the future.There is still the question of how to export optional metadata for each CDS, such as "display name", "color" etc. For the time being I think it's ok to leave this as a script-based "optional extra" for workflows, but perhaps others see a nice way to implement this.
Related issues / discussions:
#953 covers GFF parsing in augur
We've discussed parsing GFFs within Augur recently in this slack thread
Auspice PR related to this work is here
Which pathogens does this affect?
For the time being, not many.
Path forward
This can be implemented in steps / multiple PRs
I will try to do this shortly.This is done in Annotations schema updates #1281The text was updated successfully, but these errors were encountered: