New project proposal: bio2zarr #1189
Replies: 3 comments 1 reply
-
I'm a fan of this approach, which lets sgkit focus on doing interesting and useful manipulations and analysis. Separating these out has made me wonder what "name" we give to the sgkit/xarray/zarr file format that this tool would convert to. Naming things and cache invalidation, right? 😆 |
Beta Was this translation helpful? Give feedback.
-
To answer my own question at least partially here, I've compared the time to compute So that's 4 minutes for the zarr version vs ~4.5 minutes on the plink backend, despite the plink backend having uncompressed data. Looking at the profiles briefly, it looks like the majority of the time is spent in the massaging the plink formatted numpy array into the sgkit form with np.wheres. You would make up the 20 minutes needed to do the conversion pretty quickly at that rate... |
Beta Was this translation helpful? Give feedback.
-
I'm +1 on this approach. Thanks for all your work on this Jerome! I agree that we should only have one way to convert to Zarr format, and should deprecate the existing VCF to Zarr conversion code once the new code reaches feature parity. |
Beta Was this translation helpful? Give feedback.
-
What is the problem?
Currently, sgkit's approach to converting files from standard formats (VCF, plink, bgen) is to create an xarray dataset and to use the standard Dask-based methods of writing it out to file (let's not worry about permutations like cloud stores for the moment). For plink and bgen, the dataset is created by making lightweight wrappers around libraries that can read these files directly, so columns can be treated as Dask arrays. For VCF, there is a more explicit chunk-by-chunk conversion process, with the VCF being partitioned into slices, these slices converted to Zarr, and then the zarrs merged back together again.
Unfortunately, the
x_to_zarr
methods don't work very well. I am a downstream user of sgkit, and urgently need to get VCFs for very large datasets into sgkit format. My group have tried repeatedly to do this, but we have failed for all but the shortest chromosomes to get the process to finish, with only cryptic dask failures to help, and resorted to essentially pulling tuning parameters out of a hat to try and get it to work. Some related issues:Converting plink to zarr at scale doesn't seem to work very well either: sgkit-dev/sgkit-publication#78 (comment)
The fundamental problem here is that these large data-bound conversion processes are a poor fit to Dask's design. Because of this, you cannot actually get these things to work without really understanding and being able to debug Dask problems. So, for anyone to be even able to start using sgkit, they must be a Dask expert first. To get any progress on their conversion process they must know how to use the Dask dashboard, and be able to interpret the task streams.
This is unacceptable, and will make sure that the vast majority of potential users will not succeed in getting over the first hurdle. There must be a more gentle learning curve for starting with Dask, where users can start with doing things it's actually good at, like distributed array operations.
Proposed solution
In a desperate effort to get VCF conversion working at all, I have ended up writing a concurrent.futures based approach in #1185 . This works really well, and is ingesting VCFs at the largest scale in a reliable, predictable and feedback oriented way. Out of interest, I also put together a quick converter for plink (pystatgen/sgkit@f45089e, just the genotypes, but the other fields should be trivial compute-wise). This took very little code, and is immediately efficient (about 20 minutes for a million sample chr21 file, vs Dask based method failing cryptically after an hour).
So, my suggestion is that we develop a
sgkit-dev/bio2zarr
package, which is focused on delivering these conversion utilities. Initially it would consist of the three formats supported in sgkit, but I would like to present it as open to contributions for other formats (like FASTA, for example). If we succeed in getting people to contribute, then I think we could write an application-note type paper about it in a few years (with all contributors being authors, obviously).Form-factor-wise, bio2zarr would have top-level functions
vcf_to_zarr
,plink_to_zarr
etc, and also ship corresponding CLIs (vcf2zarr, etc). Guiding principles are reliability, modularity, fine-grained control over final zarr, predictability of memory usage, and fine-grained progress feedback.What about sgkit's IO package?
That's a good question, and I think should be guided by how well the current shim-over-library-dask approach really works at scale. Have we tried doing some basic operations like
compute_variant_stats
on non-zarr backed datasets at a decent scale? What's the performance difference between them? Are they equally reliable?I think it's pretty clear that the
x_to_zarr
functions don't really work. Personally I think having a single clear message for how to get started working with sgkit would be a major improvement: first convert your data using bio2zarr. That's a much simpler message to communicate than "well, if your dataset is fairly small, you can probably useread_x
which means you don't have to convert it, but if you're getting cryptic errors or things are taking a long time you should probably just go ahead and convert to zarr before posting any questions on our discussion group".Beta Was this translation helpful? Give feedback.
All reactions