You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
VariantSpark is currently optimised for reasonally small sample sizes (n=100-5000) and large numbers of variants (e.g. 42 million) , ie. 'wide' datasets. Working on phenotypes in UKBB, e.g. CAD we have samples sizes of ~50K at our disposal and VariantSpark has a long run time ( ~3day) when dealing with such sample sizes. As we expect genomic cohorts to grow in size it is worth considering how we can optimise VariantSpark for larger sample sizes (50K plus).
The text was updated successfully, but these errors were encountered:
This work is inspiring, great method and deployment model!
You could always apply the idea of summarising the total number of samples into a reduced dimension, predicting in the reduced sample space, and then applying the learned parameters to predict the original variable.
This could prove useful for the VariantSpark method that works on millions of features, yet takes longer with tens of thousands of samples.
If you like this idea, I've implemented a method that finds an encoding of the sample space, reduces the samples enough to carry out a faster and more efficient regression, and then unfolds the prediction to make it seem as though it ran on the full sample space.
I've tried to run it using the sample 1000 Genomes dataset, but run into errors when installing the actual library on my local machine, so I can't apply this idea myself with VariantSpark unfortunately.
If you need help with translating the code to CSV / VCF files, let me know in this issue thread. If it works, let me know here too - would be great to work on this with the team!
VariantSpark is currently optimised for reasonally small sample sizes (n=100-5000) and large numbers of variants (e.g. 42 million) , ie. 'wide' datasets. Working on phenotypes in UKBB, e.g. CAD we have samples sizes of ~50K at our disposal and VariantSpark has a long run time ( ~3day) when dealing with such sample sizes. As we expect genomic cohorts to grow in size it is worth considering how we can optimise VariantSpark for larger sample sizes (50K plus).
The text was updated successfully, but these errors were encountered: