How to support multiple SNP-sets #77

stschiff · 2024-06-14T14:09:48Z

Right now, Poseidon is officially flexible with respect to the Snp-Set of the genotype files, but the archives are of course not. All public archives currently support only the 1240K format, or the HO-subset of it.

It would be desirable in the future to support more call-sets, in particular in light of more shotgun-sequencing being done. In light of the fact that our Poseidon-IDs are supposed to be unique per archive, which excludes the option to place multiple packages with different call-sets into the same archive, there are two basic options to consider:

Option 1: Split Archives to allow other than 1240K call sets, for example a "Community-Archive 1000G calls" or so. This does not require any change in the schema, and would be straight-forward to do with current infrastructure.
Pros: Simple, non-breaking and in principle immediately doable.
Cons: Meta-data will be duplicated across archives, which causes redundancy and may require complex syncing infrastructure to update Janno-files across archives.

Option 2: Extend the schema to allow for multiple genotype-datasets within one package
This is currently not possible, but is in principle not hard to implement. We would simply allow the YAML-schema to list multiple genotype datasets, each with their own snpset and separate genotype files.
There is one catch: The Janno-File contains several columns that are specific to the call-set (Genotype_Ploidy, Data_Preparation_Pipeline_URL, Nr_SNPs, Coverage_on_Target_SNPs). These can easily be made list-columns, of course, which would be non-breaking.
Pros: Would be a non-redundant solution with respect to package-metadata and Janno-files, as these would not be duplicated.
Cons: Would require some additional implementation in the server and trident list and forge functionality. A minimal solution would be to ignore any call-set after the first in trident, and see how we can support secondary call-sets later on. However, at least with fetch, large-scale adoption of hosting multiple call-sets would result in much larger downloads of packages. So perhaps one should somehow change the server software to create multiple zip-files for download, with or without secondary call-sets.

The text was updated successfully, but these errors were encountered:

stschiff · 2024-09-13T08:13:43Z

OK, we briefly discussed this in our Meeting on September 13. Given our limited dev-resources, Option 1 is more likely.

One compromise would be to expand Minotaur once Eager3 is out and create additional Pull-Downs, perhaps even with imputation, and then release some compromise archive, with a snpSet of perhaps something like 5 million common SNPs as a subset from 1000 Genomes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to support multiple SNP-sets #77

How to support multiple SNP-sets #77

stschiff commented Jun 14, 2024

stschiff commented Sep 13, 2024

How to support multiple SNP-sets #77

How to support multiple SNP-sets #77

Comments

stschiff commented Jun 14, 2024

stschiff commented Sep 13, 2024