-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Review recommendation] Document the reference genome used for each sample #79
Comments
This sounds straight forward to me. We can just add another .janno column |
Yes, in principle this one is easy and definitely useful. The devil is in the details though. In many cases, people simply don't know the exact reference, at least not with the ID used by ENA. Most people here, for example, use "hs37d5" or "hg19", and would find it hard to make sure which exact assembly ID it is. I think we might have to make this a free-text field and then come up with some policy for the Archive and curate these things upon submissions. @TCLamnidis what do you think? |
afaik, hg19 and hs37d5 are identical in chromosomes 1-22,X,Y with only differences in mtDNA and added contigs. |
OK. I think regarding the species-question, I would simply say that we specify in the schema that if it's human, then it must follow a certain naming scheme. If it's not human, it can be anything? That would be validatable using a new Species field. Regarding the assembly name. @TCLamnidis the issue is that there isn't a single "GRCh38" or the like. There are patch-versions, too, as you can see in the "Revision history" (bottom of this webpage): https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000001405.26/ So we would then really have to be quite specific and ask for the exact patch-version, so something like "GRCh38.p14". But then this is what I meant: Most people simply don't know which version they used, and they may never find out, because either their mapping command lines may be in limbo, or even if they still have all the Eager runs, the FASTA-files they referenced to may be lost or something. It's really a bit of a tricky question. |
I'm against a free text field. A clear accession number adds much more value and if "in the EVA archive, every VCF dataset has a 'Genome Assembly' metadata field specifying the accession number of the reference genome used", as the reviewer writes, then I don't think we should expect less. But I also understand Stephan's practical concern of people not knowing what they used for samples published in the past. For this case we could recommend certain patch releases most representative for a given main release. Or allow a range of releases. Or add not one, but two columns:
Why are there no accession IDs for animal reference genomes? You know my ignorance on these topics, but here's an accession number of a sheep reference genome: GCA_000298735.1 by the "International Sheep Genome Consortium". |
I like these ideas. You're right, ultimately expecting a concrete assembly ID is the right thing to do. Perhaps we can actually find out what our ominous "hg19" or "hs37d5" genomes actually are and then simply add these assemblies for past datasets and spread the word. And I agree this could simply extend towards non-human species, I think! |
This recommendation was raised in the review of the Poseidon paper.
The text was updated successfully, but these errors were encountered: