Skip to content
This repository has been archived by the owner on Jan 25, 2023. It is now read-only.

About assemblies #270

Open
teemukataja opened this issue Feb 13, 2019 · 3 comments
Open

About assemblies #270

teemukataja opened this issue Feb 13, 2019 · 3 comments

Comments

@teemukataja
Copy link
Contributor

This is related to #222

Currently the specification describes, that assemblyId should be given in GRCh format. But what if a dataset that is older than GRCh is shared via Beacons, and isn't sequenced using an assembly that is directly translatable to modern assemblies? In beacon-python we started to use this regex ^((GRCh|hg)[0-9]+([.]?p[0-9]+)?)$ for assemblyId validation which allows the following formats:

GRCh37
GRCh37p13
GRCh37.p13
hg19

I found out that the hg notation can be used to some extent, as it has a translation for both NCBI and GRC assemblies. Are there other common assemblies that are used and should be supported? Is there a reason that only GRC notation should be enforced, or should we broaden the allowed assemblies?

I believe @mbaudis might have some knowledge on this matter?

@cyenyxe
Copy link

cyenyxe commented Jul 2, 2019

I would strongly recommend using sequence accessions instead of names, because they are completely unambiguous (they clearly refer to a unique version of an assembly), and at the same time can be mapped against multiple names in a GUI for user convenience.

Using sequence accessions would also allow to support non-human species and sequences that are not just assemblies.

@mbaudis
Copy link
Member

mbaudis commented Jul 3, 2019

@cyenyxe A problem here is the support at the resource level, especially when doing federated queries. With the original Beacon being more a "social experiment", it was easier to provide limited, fixed options.

I agree that attributes like assemblyId or chromosome ... should be specified by referencing some external standard, and then specific environments, networks ... can document which values will be supported. I guess this will be part of the current "re-thinking" for v2. Pinging @sdelatorrep @jrambla for taking note.

@cyenyxe
Copy link

cyenyxe commented Jul 3, 2019

The standard for sequence accessioning would be that defined by the INSDC consortium, made of the ENA, GenBank and DDBJ. For instance, GCA_000001405.14 identifies GRCh37.p13.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants