-
Hi, First trying directly with vcf_to_zarr this fills my 512Go memory and is killed.
I imagine this is due to the rechunking step being memory intensive. So I thought I understood that the way to go was to use
This is running right now but already taking more than 200Go and growing. What am I missing here? Is there a way to reduce the memory requirement of the conversion? What is the best way to obtain an sgkit formatted dataset from vcf for large data? Thanks a lot! |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
Hi @alxsimon - thanks for raising this issue.
The The underyling Dask issue has never been fixed (see dask/dask#6745), but we do have some (internal) code in sgkit that avoids this memory problem:https://github.com/pystatgen/sgkit/blob/718fe3b58b3da2d231bbfa6330a88bf209068c92/sgkit/io/utils.py#L111-L148 Unfortunately it's not hooked up to I'm going to take a look at hooking up that code to vcf_to_zarr(in_vcf_wgs, out_zarr, target_part_size=None, ...) This will avoid writing the intermediate Zarr files, and hence shouldn't hit any memory issues due to rechunking, but it will not be able to take advantage of multiple CPUs on your machine. Hope all of that makes sense! Please let us know how you get on or if you have any more questions. BTW the reason that your second snippet of code didn't help is that |
Beta Was this translation helpful? Give feedback.
Hi @alxsimon - thanks for raising this issue.
The
vcf_to_zarr
function partitions the VCF (or BCF in this case) into a set of contiguous regions that cover the while file,and then writes an intermediate Zarr file for each partition, in parallel (using Dask). The intermediate Zarr files are concatenated, rechunked, then written to the final output (again using Dask). We have had problems in the past with this rechunking step running out of memory, which is what seems to be happening in this case.
The underyling Dask issue has never been fixed (see dask/dask#6745), but we do have some (internal) code in sgkit that avoids this memor…