-
We are excited to work with @arunkk09 and @shy218! Arun sent me a few questions over email that I thought I'd answer here.
|
Beta Was this translation helpful? Give feedback.
Replies: 5 comments 3 replies
-
PLINK is the most commonly used genomics toolkit. It's written in C/C++ and primarily used via is command-line interface. It's very fast on a single node but does not scale out. Hail is also quite common. It's written in Scala and Python and uses Apache Spark for execution. It's primarily used as a library and is slow for small tasks on a single node but scales up and out fairly well. scikit-allel is a Python library by @alimanfoo, one of our collaborators, and is effectively the precursor to @eric-czech made a Toolkit Comparison sheet that includes a few additional options and attempts to characterize the data representation and methods available in each toolkit. I'd add GCTA to his list, as it's a commonly used library for some of the more complex calculations in our field. @OpenMendel is also interesting as an ambitious collection of Julia packages with a similar aim. |
Beta Was this translation helpful? Give feedback.
-
I would suggest we focus on statistical and population genetics first. |
Beta Was this translation helpful? Give feedback.
-
@eric-czech's Core operations in human GWAS workloads gives an overview of the core workloads and their performance characteristics. I will try to comment on each of those categories here, with an updated on our status. Summary statistics for QC [O(nm)]
LD estimation / pruning [O(nmd)]
Relatedness estimation / pruning [O(nm^2)]
Variant normalization [O(nm)]
Population structure estimation [O(nmk)]
Association testing [O(nm)]
|
Beta Was this translation helpful? Give feedback.
-
Our two primary datasets are the UK Biobank (UKB) data, which consists of 500k people with imputed genotypes at around 90M sites and is unfortunately private, and the Ag1000G phase 3 public dataset https://github.com/pystatgen/sgkit/discussions/468. To approximate the UKB data, we've explored a variety of options.
The analysis goals are detailed in https://github.com/pystatgen/sgkit/discussions/481#discussioncomment-430892. I'm happy to provide more motivation for each of the core workloads once you've read through A tutorial on conducting genome-wide association studies: Quality control and statistical analysis (2018). I think you know the software setup fairly well by now: we have implemented a python library that uses Xarray's data model and API to implement operations on genetic data, Dask to distribute computations, Zarr to serialize data, and occasionally Numba to accelerate computations. I guess the only other library of note is fsspec which we use to provide a single interface to local and cloud storage. We currently run all of our computations on Google Cloud. I believe @aktech is using Coiled Computing on AWS for some workloads, and our recently implemented microbenchmarks run on GitHub Actions, for now. If you have access to dedicated resources for benchmarking consistency, that might be one way you could help us out! |
Beta Was this translation helpful? Give feedback.
-
We'd like to benchmark the performance and scalability of each of our core operations with realistic public data against PLINK and Hail, as a start. Then, we'd like to ensure our performance is near PLINK's and our scalability is near Hail's, and we'd like to run these benchmarks regularly so that we ensure we don't regress. Finally, it would be fun to pick one workload where we might improve performance beyond the state of the art using something clever from the systems or algorithms literature. Beyond benchmarking, I love that SLAB included an analysis of library ergonomics, and we'd love to hear your suggestions on how we might present an interface to users that can transparently scale out or run on GPUs when necessary. |
Beta Was this translation helpful? Give feedback.
@eric-czech's Core operations in human GWAS workloads gives an overview of the core workloads and their performance characteristics. I will try to comment on each of those categories here, with an updated on our status.
Summary statistics for QC [O(nm)]
sgkit
.sgkit
prototype against PLINK, Hail, and Glow at qc_call_rate_benchmarking.ipynbsgkit
hassampl…