You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Right now, Poseidon is officially flexible with respect to the Snp-Set of the genotype files, but the archives are of course not. All public archives currently support only the 1240K format, or the HO-subset of it.
It would be desirable in the future to support more call-sets, in particular in light of more shotgun-sequencing being done. In light of the fact that our Poseidon-IDs are supposed to be unique per archive, which excludes the option to place multiple packages with different call-sets into the same archive, there are two basic options to consider:
Option 1: Split Archives to allow other than 1240K call sets, for example a "Community-Archive 1000G calls" or so. This does not require any change in the schema, and would be straight-forward to do with current infrastructure. Pros: Simple, non-breaking and in principle immediately doable. Cons: Meta-data will be duplicated across archives, which causes redundancy and may require complex syncing infrastructure to update Janno-files across archives.
Option 2: Extend the schema to allow for multiple genotype-datasets within one package
This is currently not possible, but is in principle not hard to implement. We would simply allow the YAML-schema to list multiple genotype datasets, each with their own snpset and separate genotype files.
There is one catch: The Janno-File contains several columns that are specific to the call-set (Genotype_Ploidy, Data_Preparation_Pipeline_URL, Nr_SNPs, Coverage_on_Target_SNPs). These can easily be made list-columns, of course, which would be non-breaking. Pros: Would be a non-redundant solution with respect to package-metadata and Janno-files, as these would not be duplicated. Cons: Would require some additional implementation in the server and trident list and forge functionality. A minimal solution would be to ignore any call-set after the first in trident, and see how we can support secondary call-sets later on. However, at least with fetch, large-scale adoption of hosting multiple call-sets would result in much larger downloads of packages. So perhaps one should somehow change the server software to create multiple zip-files for download, with or without secondary call-sets.
The text was updated successfully, but these errors were encountered:
OK, we briefly discussed this in our Meeting on September 13. Given our limited dev-resources, Option 1 is more likely.
One compromise would be to expand Minotaur once Eager3 is out and create additional Pull-Downs, perhaps even with imputation, and then release some compromise archive, with a snpSet of perhaps something like 5 million common SNPs as a subset from 1000 Genomes.
Right now, Poseidon is officially flexible with respect to the Snp-Set of the genotype files, but the archives are of course not. All public archives currently support only the 1240K format, or the HO-subset of it.
It would be desirable in the future to support more call-sets, in particular in light of more shotgun-sequencing being done. In light of the fact that our Poseidon-IDs are supposed to be unique per archive, which excludes the option to place multiple packages with different call-sets into the same archive, there are two basic options to consider:
Option 1: Split Archives to allow other than 1240K call sets, for example a "Community-Archive 1000G calls" or so. This does not require any change in the schema, and would be straight-forward to do with current infrastructure.
Pros: Simple, non-breaking and in principle immediately doable.
Cons: Meta-data will be duplicated across archives, which causes redundancy and may require complex syncing infrastructure to update Janno-files across archives.
Option 2: Extend the schema to allow for multiple genotype-datasets within one package
This is currently not possible, but is in principle not hard to implement. We would simply allow the YAML-schema to list multiple genotype datasets, each with their own
snpset
and separate genotype files.There is one catch: The Janno-File contains several columns that are specific to the call-set (
Genotype_Ploidy
,Data_Preparation_Pipeline_URL
,Nr_SNPs
,Coverage_on_Target_SNPs
). These can easily be made list-columns, of course, which would be non-breaking.Pros: Would be a non-redundant solution with respect to package-metadata and Janno-files, as these would not be duplicated.
Cons: Would require some additional implementation in the server and trident list and forge functionality. A minimal solution would be to ignore any call-set after the first in trident, and see how we can support secondary call-sets later on. However, at least with
fetch
, large-scale adoption of hosting multiple call-sets would result in much larger downloads of packages. So perhaps one should somehow change the server software to create multiple zip-files for download, with or without secondary call-sets.The text was updated successfully, but these errors were encountered: