You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Jan 25, 2023. It is now read-only.
Currently the specification describes, that assemblyId should be given in GRCh format. But what if a dataset that is older than GRCh is shared via Beacons, and isn't sequenced using an assembly that is directly translatable to modern assemblies? In beacon-python we started to use this regex ^((GRCh|hg)[0-9]+([.]?p[0-9]+)?)$ for assemblyId validation which allows the following formats:
GRCh37
GRCh37p13
GRCh37.p13
hg19
I found out that the hg notation can be used to some extent, as it has a translation for both NCBI and GRC assemblies. Are there other common assemblies that are used and should be supported? Is there a reason that only GRC notation should be enforced, or should we broaden the allowed assemblies?
I believe @mbaudis might have some knowledge on this matter?
The text was updated successfully, but these errors were encountered:
I would strongly recommend using sequence accessions instead of names, because they are completely unambiguous (they clearly refer to a unique version of an assembly), and at the same time can be mapped against multiple names in a GUI for user convenience.
Using sequence accessions would also allow to support non-human species and sequences that are not just assemblies.
@cyenyxe A problem here is the support at the resource level, especially when doing federated queries. With the original Beacon being more a "social experiment", it was easier to provide limited, fixed options.
I agree that attributes like assemblyId or chromosome ... should be specified by referencing some external standard, and then specific environments, networks ... can document which values will be supported. I guess this will be part of the current "re-thinking" for v2. Pinging @sdelatorrep@jrambla for taking note.
The standard for sequence accessioning would be that defined by the INSDC consortium, made of the ENA, GenBank and DDBJ. For instance, GCA_000001405.14 identifies GRCh37.p13.
Sign up for freeto subscribe to this conversation on GitHub.
Already have an account?
Sign in.
This is related to #222
Currently the specification describes, that assemblyId should be given in
GRCh
format. But what if a dataset that is older than GRCh is shared via Beacons, and isn't sequenced using an assembly that is directly translatable to modern assemblies? Inbeacon-python
we started to use this regex^((GRCh|hg)[0-9]+([.]?p[0-9]+)?)$
forassemblyId
validation which allows the following formats:I found out that the
hg
notation can be used to some extent, as it has a translation for bothNCBI
andGRC
assemblies. Are there other common assemblies that are used and should be supported? Is there a reason that only GRC notation should be enforced, or should we broaden the allowed assemblies?I believe @mbaudis might have some knowledge on this matter?
The text was updated successfully, but these errors were encountered: