New project proposal: bio2zarr #1189

jeromekelleher · 2024-02-07T16:12:45Z

jeromekelleher
Feb 7, 2024
Maintainer

What is the problem?

Currently, sgkit's approach to converting files from standard formats (VCF, plink, bgen) is to create an xarray dataset and to use the standard Dask-based methods of writing it out to file (let's not worry about permutations like cloud stores for the moment). For plink and bgen, the dataset is created by making lightweight wrappers around libraries that can read these files directly, so columns can be treated as Dask arrays. For VCF, there is a more explicit chunk-by-chunk conversion process, with the VCF being partitioned into slices, these slices converted to Zarr, and then the zarrs merged back together again.

Unfortunately, the x_to_zarr methods don't work very well. I am a downstream user of sgkit, and urgently need to get VCFs for very large datasets into sgkit format. My group have tried repeatedly to do this, but we have failed for all but the shortest chromosomes to get the process to finish, with only cryptic dask failures to help, and resorted to essentially pulling tuning parameters out of a hat to try and get it to work. Some related issues:

Converting plink to zarr at scale doesn't seem to work very well either: sgkit-dev/sgkit-publication#78 (comment)

The fundamental problem here is that these large data-bound conversion processes are a poor fit to Dask's design. Because of this, you cannot actually get these things to work without really understanding and being able to debug Dask problems. So, for anyone to be even able to start using sgkit, they must be a Dask expert first. To get any progress on their conversion process they must know how to use the Dask dashboard, and be able to interpret the task streams.

This is unacceptable, and will make sure that the vast majority of potential users will not succeed in getting over the first hurdle. There must be a more gentle learning curve for starting with Dask, where users can start with doing things it's actually good at, like distributed array operations.

Proposed solution

In a desperate effort to get VCF conversion working at all, I have ended up writing a concurrent.futures based approach in #1185 . This works really well, and is ingesting VCFs at the largest scale in a reliable, predictable and feedback oriented way. Out of interest, I also put together a quick converter for plink (pystatgen/sgkit@f45089e, just the genotypes, but the other fields should be trivial compute-wise). This took very little code, and is immediately efficient (about 20 minutes for a million sample chr21 file, vs Dask based method failing cryptically after an hour).

So, my suggestion is that we develop a sgkit-dev/bio2zarr package, which is focused on delivering these conversion utilities. Initially it would consist of the three formats supported in sgkit, but I would like to present it as open to contributions for other formats (like FASTA, for example). If we succeed in getting people to contribute, then I think we could write an application-note type paper about it in a few years (with all contributors being authors, obviously).

Form-factor-wise, bio2zarr would have top-level functions vcf_to_zarr, plink_to_zarr etc, and also ship corresponding CLIs (vcf2zarr, etc). Guiding principles are reliability, modularity, fine-grained control over final zarr, predictability of memory usage, and fine-grained progress feedback.

What about sgkit's IO package?

That's a good question, and I think should be guided by how well the current shim-over-library-dask approach really works at scale. Have we tried doing some basic operations like compute_variant_stats on non-zarr backed datasets at a decent scale? What's the performance difference between them? Are they equally reliable?

I think it's pretty clear that the x_to_zarr functions don't really work. Personally I think having a single clear message for how to get started working with sgkit would be a major improvement: first convert your data using bio2zarr. That's a much simpler message to communicate than "well, if your dataset is fairly small, you can probably use read_x which means you don't have to convert it, but if you're getting cryptic errors or things are taking a long time you should probably just go ahead and convert to zarr before posting any questions on our discussion group".

benjeffery · 2024-02-07T17:09:58Z

benjeffery
Feb 7, 2024
Collaborator

I'm a fan of this approach, which lets sgkit focus on doing interesting and useful manipulations and analysis.

Separating these out has made me wonder what "name" we give to the sgkit/xarray/zarr file format that this tool would convert to. zarr is the most basic description - after all the minimum requirement to use these files is zarr. However I wonder if bio2sgkit might be better at signposting the usual way people will be interacting and using these files. Over at tsinfer the import routines are called SgkitSampleData not ZarrSampleData, and the usual way to refer to them on disk is "an sgkit dataset". I realise we have a spec called "VCF Zarr Specification" but that isn't what we're producing for plink right?

Naming things and cache invalidation, right? 😆

1 reply

jeromekelleher Feb 7, 2024
Maintainer Author

That's true. I was kind of hoping that people would contribute converters for stuff that's not directly used in sgkit, like say a large collection of alignments in FASTA format...

jeromekelleher · 2024-02-07T17:32:02Z

jeromekelleher
Feb 7, 2024
Maintainer Author

That's a good question, and I think should be guided by how well the current shim-over-library-dask approach really works at scale. Have we tried doing some basic operations like compute_variant_stats on non-zarr backed datasets at a decent scale? What's the performance difference between them? Are they equally reliable?

To answer my own question at least partially here, I've compared the time to compute sample_stats on the plink vs zarr backends, on the first 5000 variants of the chr21 HAPNEST data (1M samples). Chunk sizes are the same (1000x1000).

So that's 4 minutes for the zarr version vs ~4.5 minutes on the plink backend, despite the plink backend having uncompressed data. Looking at the profiles briefly, it looks like the majority of the time is spent in the massaging the plink formatted numpy array into the sgkit form with np.wheres.

You would make up the 20 minutes needed to do the conversion pretty quickly at that rate...

0 replies

tomwhite · 2024-02-07T17:37:45Z

tomwhite
Feb 7, 2024
Maintainer

I'm +1 on this approach. Thanks for all your work on this Jerome!

I agree that we should only have one way to convert to Zarr format, and should deprecate the existing VCF to Zarr conversion code once the new code reaches feature parity.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New project proposal: bio2zarr #1189

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 1 reply

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

New project proposal: bio2zarr #1189

jeromekelleher Feb 7, 2024 Maintainer

What is the problem?

Proposed solution

What about sgkit's IO package?

Replies: 3 comments · 1 reply

benjeffery Feb 7, 2024 Collaborator

jeromekelleher Feb 7, 2024 Maintainer Author

jeromekelleher Feb 7, 2024 Maintainer Author

tomwhite Feb 7, 2024 Maintainer

jeromekelleher
Feb 7, 2024
Maintainer

Replies: 3 comments 1 reply

benjeffery
Feb 7, 2024
Collaborator

jeromekelleher Feb 7, 2024
Maintainer Author

jeromekelleher
Feb 7, 2024
Maintainer Author

tomwhite
Feb 7, 2024
Maintainer