Questions for UCSD collaboration #481

hammer · 2021-03-04T19:55:56Z

hammer
Mar 4, 2021
Maintainer

We are excited to work with @arunkk09 and @shy218! Arun sent me a few questions over email that I thought I'd answer here.

A deepdive of your datasets, analysis goals, and software/hardware setup. A demo of some sort will help us here.
Where you are today with your tool status on scalability, efficiency, etc.? What are the biggest bottlenecks?
Where do you want to be in the near term and long terms? What are your priority technical items to get there?
What are the reasonable open-source comparators to sgkit for the workloads/applications that it targets?
Do we aim for genomics first? Or look more broadly to cover more domains that use Xarray, Zarr, and Dask now itself (astronomy, etc.)?

Answered by hammer

Mar 4, 2021

Where you are today with your tool status on scalability, efficiency, etc.? What are the biggest bottlenecks?

@eric-czech's Core operations in human GWAS workloads gives an overview of the core workloads and their performance characteristics. I will try to comment on each of those categories here, with an updated on our status.

Summary statistics for QC [O(nm)]

Historical note: poor Hail performance for this workload, as noted by @eric-czech in January 2020 at Poor performance for QC filtering on medium sized genotype data, is part of what led us to create sgkit.
@eric-czech benchmarked our sgkit prototype against PLINK, Hail, and Glow at qc_call_rate_benchmarking.ipynb
sgkit has sampl…

View full answer

hammer · 2021-03-04T20:24:59Z

hammer
Mar 4, 2021
Maintainer Author

What are the reasonable open-source comparators to sgkit for the workloads/applications that it targets?

PLINK is the most commonly used genomics toolkit. It's written in C/C++ and primarily used via is command-line interface. It's very fast on a single node but does not scale out.

Hail is also quite common. It's written in Scala and Python and uses Apache Spark for execution. It's primarily used as a library and is slow for small tasks on a single node but scales up and out fairly well.

scikit-allel is a Python library by @alimanfoo, one of our collaborators, and is effectively the precursor to sgkit.

@eric-czech made a Toolkit Comparison sheet that includes a few additional options and attempts to characterize the data representation and methods available in each toolkit.

I'd add GCTA to his list, as it's a commonly used library for some of the more complex calculations in our field. @OpenMendel is also interesting as an ambitious collection of Julia packages with a similar aim.

0 replies

hammer · 2021-03-04T20:29:29Z

hammer
Mar 4, 2021
Maintainer Author

Do we aim for genomics first? Or look more broadly to cover more domains that use Xarray, Zarr, and Dask now itself (astronomy, etc.)?

I would suggest we focus on statistical and population genetics first.

0 replies

hammer · 2021-03-04T21:05:58Z

hammer
Mar 4, 2021
Maintainer Author

Where you are today with your tool status on scalability, efficiency, etc.? What are the biggest bottlenecks?

@eric-czech's Core operations in human GWAS workloads gives an overview of the core workloads and their performance characteristics. I will try to comment on each of those categories here, with an updated on our status.

Summary statistics for QC [O(nm)]

Historical note: poor Hail performance for this workload, as noted by @eric-czech in January 2020 at Poor performance for QC filtering on medium sized genotype data, is part of what led us to create sgkit.
@eric-czech benchmarked our sgkit prototype against PLINK, Hail, and Glow at qc_call_rate_benchmarking.ipynb
sgkit has sample_stats and variant_stats implemented in aggregation.py.
We use Numba decorators to accelerate this code.
Performance is mostly fine.

LD estimation / pruning [O(nmd)]

We put a lot of effort into this one in our prototype: PyData prototype LD prune implementation related-sciences/gwas-analysis#26.
At that time we found GPUs made this workload faster: ld_matrix.ipynb.
Bringing this implementation over to sgkit is blocked right now on some work we'd like to do on the windowing API: https://github.com/pystatgen/sgkit/issues/31.
I think performance on the GPU is pretty good, though scalability is an issue.

Relatedness estimation / pruning [O(nm^2)]

Implemented in sgkit by @ravwojdyla: https://github.com/pystatgen/sgkit/issues/24
There are many performance details in the linked issue, including a forum post which provides a high-level overview.

Variant normalization [O(nm)]

I don't think we've contemplated this work yet in sgkit.

Population structure estimation [O(nmk)]

We have an umbrella issue for this workload: https://github.com/pystatgen/sgkit/issues/226.
PCA is a performance and scalability bottleneck here.
The PCA PR https://github.com/pystatgen/sgkit/pull/262 links out to the many upstream fixes we made in Dask.
@aktech has done work on pairwise distance computations in https://github.com/pystatgen/sgkit/issues/241, but those computations are O(nm^2), not O(nmk), I think.
@aktech is now working to get the pairwise distance computations running on the GPU at https://github.com/pystatgen/sgkit/issues/338.

Association testing [O(nm)]

You've seen https://github.com/pystatgen/sgkit/issues/390, which are performance issues we've encountered when running a simple linear model over a very large data set.
Trait-at-a-time linear mixed models are commonly used in this space but we don't yet have an implementation of a method to compare to SAIGE (2018), BOLT-LMM (2015), GEMMA (2012), or FaST-LMM (2011).
@eric-czech jumped ahead to a multi-trait linear mixed (I think?) model called REGENIE: https://github.com/pystatgen/sgkit/issues/50. For more discussion of REGENIE, see our forum post.
I think performance is now okay for the linear model but there's still room for improvement. For REGENIE I don't think we can yet run it at biobank scale, but we'll have to wait for @eric-czech to weigh in on that one.

3 replies

eric-czech Apr 12, 2021
Maintainer

For association testing, this is also a useful resource: https://nbviewer.jupyter.org/gist/eric-czech/bdff3402d27b5cccd7d8aacab0957a93

This is a notebook that simulates data to test for associations in and then runs a minimal dask-only GWAS. This may be a clearer representation of the linear algebra operations at work than reading through the rest of the sgkit codebase.

hammer Apr 12, 2021
Maintainer Author

Might want to @ mention people I’m not sure if this comment will show up otherwise

hammer Apr 12, 2021
Maintainer Author

@arunkk09 @shy218 FYI ☝️

hammer · 2021-03-04T21:39:27Z

hammer
Mar 4, 2021
Maintainer Author

A deepdive of your datasets, analysis goals, and software/hardware setup. A demo of some sort will help us here.

Our two primary datasets are the UK Biobank (UKB) data, which consists of 500k people with imputed genotypes at around 90M sites and is unfortunately private, and the Ag1000G phase 3 public dataset https://github.com/pystatgen/sgkit/discussions/468.

To approximate the UKB data, we've explored a variety of options.

@eric-czech put a notebook up at https://github.com/pystatgen/sgkit/issues/438 that will generate some data and run a GWAS on the generated data.
Hail's GWAS Tutorial uses a subset of the 1000 Genomes data (found at https://storage.googleapis.com/hail-tutorial/1kg.vcf.bgz) together with some fake phenotypes (found at https://storage.googleapis.com/hail-tutorial/1kg_annotations.txt).
For the sgkit prototype we used rice and dog data.

The analysis goals are detailed in https://github.com/pystatgen/sgkit/discussions/481#discussioncomment-430892. I'm happy to provide more motivation for each of the core workloads once you've read through A tutorial on conducting genome-wide association studies: Quality control and statistical analysis (2018).

I think you know the software setup fairly well by now: we have implemented a python library that uses Xarray's data model and API to implement operations on genetic data, Dask to distribute computations, Zarr to serialize data, and occasionally Numba to accelerate computations. I guess the only other library of note is fsspec which we use to provide a single interface to local and cloud storage.

We currently run all of our computations on Google Cloud. I believe @aktech is using Coiled Computing on AWS for some workloads, and our recently implemented microbenchmarks run on GitHub Actions, for now. If you have access to dedicated resources for benchmarking consistency, that might be one way you could help us out!

0 replies

hammer · 2021-03-04T21:44:09Z

hammer
Mar 4, 2021
Maintainer Author

Where do you want to be in the near term and long terms? What are your priority technical items to get there?

We'd like to benchmark the performance and scalability of each of our core operations with realistic public data against PLINK and Hail, as a start. Then, we'd like to ensure our performance is near PLINK's and our scalability is near Hail's, and we'd like to run these benchmarks regularly so that we ensure we don't regress. Finally, it would be fun to pick one workload where we might improve performance beyond the state of the art using something clever from the systems or algorithms literature.

Beyond benchmarking, I love that SLAB included an analysis of library ergonomics, and we'd love to hear your suggestions on how we might present an interface to users that can transparently scale out or run on GPUs when necessary.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions for UCSD collaboration #481

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 5 comments 3 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Questions for UCSD collaboration #481

hammer Mar 4, 2021 Maintainer

Summary statistics for QC [O(nm)]

Replies: 5 comments · 3 replies

hammer Mar 4, 2021 Maintainer Author

hammer Mar 4, 2021 Maintainer Author

hammer Mar 4, 2021 Maintainer Author

Summary statistics for QC [O(nm)]

LD estimation / pruning [O(nmd)]

Relatedness estimation / pruning [O(nm^2)]

Variant normalization [O(nm)]

Population structure estimation [O(nmk)]

Association testing [O(nm)]

eric-czech Apr 12, 2021 Maintainer

hammer Apr 12, 2021 Maintainer Author

hammer Apr 12, 2021 Maintainer Author

hammer Mar 4, 2021 Maintainer Author

hammer Mar 4, 2021 Maintainer Author

hammer
Mar 4, 2021
Maintainer

Replies: 5 comments 3 replies

hammer
Mar 4, 2021
Maintainer Author

hammer
Mar 4, 2021
Maintainer Author

hammer
Mar 4, 2021
Maintainer Author

eric-czech Apr 12, 2021
Maintainer

hammer Apr 12, 2021
Maintainer Author

hammer Apr 12, 2021
Maintainer Author

hammer
Mar 4, 2021
Maintainer Author

hammer
Mar 4, 2021
Maintainer Author