Xarray conventions for variation data #695
Replies: 12 comments
-
(Posted by @eric-czech) My two cents is that I think two good examples of approaches at extremes are Hail and what you did with scikit-allel. In Hail, they basically pass around a data structure that looks a lot like that Zarr group to every single function (though partitioning by chromosome isn't a part of it). I liked what you did in scikit-allel better though because it separates data validation and projection/extraction from the method implementations -- i.e. you don't have to worry about what you're given in a method, you can just assume whatever should be true for something like a One possible sequence we could go through in workflows would be:
Distilling that a bit:
I know I kept using the variable name "data" as the sort of primary array in my prototype examples, but there was a model with As far as grouping by contig goes, I've been using something like this for any per-chromosome computations: # Get number of variants per contig
chunks = np.unique(ds.contig, return_counts=True)[1]
# Rechunk all variables in Dataset at contig boundaries
dsc = ds.chunk(dict(variant=tuple(chunks), sample=ds.dims['sample'])) so then all the variables end up looking like this for 24 contigs: What I haven't figured out yet is if it's possible to get the dataset representing a single contig without doing a slice on global indexes. I'm also not sure if open_zarr would automatically load into chunks like that if stored as one group per contig. Have you ever tried that? Thank you as usual for all the details! |
Beta Was this translation helpful? Give feedback.
-
(Posted by @alimanfoo) Hi folks, I did a bit of hacking to see if I could write a function to build an xarray dataset holding some existing Zarr data generated via the scikit-allel vcf_to_zarr function. Here's a notebook with code and examples. TL;DR it is very doable and might be useful as a way to experiment with using xarray with some existing data, without having to modify the data to use the special xarray attributes. |
Beta Was this translation helpful? Give feedback.
-
(Posted by @eric-czech) Hey @alimanfoo, looking at that example some more I had a few thoughts on what might make for a good convergence in Xarray representations for our use cases and yours. I'm trying to envision a format that would live just downstream of VCF, PLINK, and BGEN readers (like you suggested in the first post) but adhere to just enough conventions that it would contain everything we need for QC and association analysis. I think a dataset like this could contain a lot of extra "entry" fields from VCF files as well but I wanted to propose some structures for the things I mentioned on our call today that are mostly unrelated to those. Here's a layout I was imagining:
Chunk boundaries at contig boundaries would be a nice standard too I think, since it would be natural coming from zarr as well as datasets like UKB where the bgen data is already broken up that way. A few questions for you then are:
|
Beta Was this translation helpful? Give feedback.
-
(Posted by @tomwhite) A quick comment on the missingness mask ( |
Beta Was this translation helpful? Give feedback.
-
(Posted by @tomwhite) Overall, the xarray representation seems very promising. @eczech is it worth trying to translate it into code so we can get a simple end-to-end analysis working? I was thinking that it would be a thin wrapper to xarray (the arguments are arrays):
Attributes can be added in a later iteration. GitHub is currently down, but perhaps we should move to PR/discussion there? |
Beta Was this translation helpful? Give feedback.
-
(Posted by @eric-czech) Sounds great @tomwhite, it makes sense to me to start there and to move any more discussion to the repo. |
Beta Was this translation helpful? Give feedback.
-
(Posted by @alimanfoo) Hi Eric, Sorry for slow reply on this one. Like the direction, some potted thoughts... Re prefixes, very happy to go with singular "variant/" and "sample/", wish I had done that in scikit-allel originally. Happy to drop "calldata/", although I'd have a mild preference for "call/" instead of "entry/" just because "entry/" sounds a bit bland. But it may help to have naming similarity with HAIL so not a strong objection. Re "variant/contig", happy with this. The name "CHROM" in VCF was always badly chosen for organisms without a chromosome-level assembly. Would "variant/contig" values always be an index into a list of contig names stored in metadata? Most users will always want to make data selections using contig name rather than contig index, so we'd need to make sure that was handled in the API somehow. Re "variant/pos" and "variant/id", happy with these, although (mundane point) capitalisation? In VCF these are "POS" and "ID". What's our rule on when we diverge from VCF capitalisation versus when we preserve it? E.g., you currently have "entry/GT" (capitalised). Side point, what about data coming from VCF INFO field? E.g., if a VCF has an "MQ" field in it's INFO field, where would that end up? I think I'd vote for "variant/MQ", but just wanted to check. Re "variant/alleles", happy with this name and storing as a single array rather than separate "REF" and "ALT". We might want to leave some flexibility on the shape and dtype, e.g., I currently work with SNPs and it's easy and convenient to have this as a (VARIANTS, 4) shape array with dtype "S1". Btw would be good to also explicitly discuss standard dimension names, you're post implicitly suggested "VARIANTS", "SAMPLES", "PLOIDY", "GENOTYPES", all of which sound good to me, we might need a few others like "ALLELES". Re "sample/id", happy with that, only question is capitalisation (we should match whatever happens with "variant/id"). Re "entry/GT", happy with that, although see above re capitalisation. Re "entry/GP", what about "entry/GL"? Generally these are talked about as genotype likelihoods. And what about phred-scaled likelihoods (PL)? Re "entry/GT/mask", couple of questions. If entry/GT uses -1 as missing signal, is entry/GT/mask a calculated field? Is it True where GT is called, or where it's missing? Re "entry/GT/phased", happy with that. Re coordinates, I don't fully understand how xarray deals with them, happy to go with the flow for now. I guess this will all come out when we figure out the API and indexing implementation for selecting data in a given genome region. Re metadata, we don't currently have a need for reference genome metadata, we always "know" what reference genome was used and load that in via pyfasta or pyfaidx when we need it. But that's mainly because we're working with data we made ourselves, if consuming someone else's data then valuable to have some metadata. Also if we need the contig names stored. Cheers, |
Beta Was this translation helpful? Give feedback.
-
(Posted by @eric-czech) Thanks @alimanfoo! A few more thoughts/responses:
👍 I think there are arguments that could be made for row/col/entry monickers rather than variant/sample/call or for just omitting the prefix and using search functions based on dimensions present in arrays (kind of like pandas.DataFrame.select_dtypes), but I've got no objections to variant/sample/call if that sits well with you.
I think going either all lower or upper for each group of variables makes sense (to avoid filesystem case sensitivity issues). I'm partial to [a-z_]+ since it has high portability with naming rules in other systems and involves a few less keystrokes but we're already breaking it with '/' and upper case info fields so I don't have a strong opinion. I'm on board with all uppercase for all variables after the initial lowercase prefix if it means our VCF to xarray converter would need to change less.
Yep, that's what I was thinking too. I wanted to standardize these fields because I think we'll all care about them as inputs to LD-pruning, PCA, and QC operations but I'm hoping these conventions for missingness, naming, phasing and genome reference will apply to any other INFO fields that are necessary.
Nice, I think it's good progress to agree that we'll represent alleles in a single field (i.e. not separate "a1", "a2", etc. fields). A (VARIANTS, 4) array will break down for some of our use cases but they'll be less common and more specific to working across studies. If it's at least in one field, I think it won't be too hard to automatically wrap and unwrap from an indexable form later on.
Let's do sample/ID and variant/ID?
My assumption is that many of the VCF fields are used in ad-hoc QC and eventually stop being relevant once you get to this point just upstream from LD-pruning, PCA, kinship estimation, association tests, etc. I don't see any reason they couldn't be included though, and they probably don't need to adhere to conventions for missingness or something like representations of phased genotype likelihoods if there aren't going to be a bunch of downstream methods implementations that rely on a standard for them.
I think enforcing the presence of a mask that is true where values are missing is the simplest solution. I say that because like @tomwhite suggested, then we could use masked backends where appropriate (following the True=missing convention there) or embed knowledge of the sentinel into numba/cuda routines (ignoring the mask). I had originally thought it would make more sense to lazily compute the masks on-demand and never serialize them (e.g. as some cached property on dataset subclasses), but that may be an unnecessary level of optimization if we want to get moving faster. Let us know if you have any thoughts/preferences there based on your past experiences with it. Thanks again for the feedback! This will make it much easier to know what to target with our I/O work. |
Beta Was this translation helpful? Give feedback.
-
(Posted by @alimanfoo) [quote="eczech, post:9, topic:44"]
My assumption is that many of the VCF fields are used in ad-hoc QC and eventually stop being relevant once you get to this point just upstream from LD-pruning, PCA, kinship estimation, association tests, etc. I don’t see any reason they couldn’t be included though, and they probably don’t need to adhere to conventions for missingness or something like representations of phased genotype likelihoods if there aren’t going to be a bunch of downstream methods implementations that rely on a standard for them. Showing my ignorance here, I was thinking of genotype likelihoods and genotype probabilities as being the same thing. Rereading the VCF spec I see there is a distinction made between genotype likelihoods (GL) and genotype probabilities from imputation (GP). So do we expect And do we expect |
Beta Was this translation helpful? Give feedback.
-
(Posted by @eric-czech) Ah good point. Hail loads probabilities from bgen to GP without scaling so I think they're taking a little liberty with the field definition. GT and/or dosage are the core fields we would care about for most methods so perhaps all that is really left to standardize is a dosage field. Are you aware of a convention for that in VCF? Otherwise, I think our bgen reader should instead create a dataset with GT, dosage, and GP with parameterizations that control how GT and dosage are created (if at all) and GP can simply be another field that we don't have to make any guarantees about until some downstream method implementations force us to. tl;dr I should have made |
Beta Was this translation helpful? Give feedback.
-
(Posted by @alimanfoo) Re dosage, I found some info on how plink handles dosage when importing data from VCF here. Also noting that the VCF 4.3 spec seems to have clarified a little the meaning of the reserved format fields GL, GP, PL, PP. |
Beta Was this translation helpful? Give feedback.
-
(Posted by @tomwhite) Thanks for all the discussion and feedback. I have started turning this into code here: https://github.com/pystatgen/sgkit/pull/10
This is how I have currently coded it, but there are some subtleties around how you get the index - is it provided? (depends on the IO format), how are names ordered?, ... etc. Hopefully this will become clearer once we start using it. |
Beta Was this translation helpful? Give feedback.
-
(Posted by @alimanfoo)
We'd like to use xarray to organise and work with variation data. It would be good to define some conventions for how we do this, i.e., what we call dimensions, typical names for certain arrays, how we deal with data from multiple contigs, etc. These conventions could then be a common target for data coming from different sources, e.g., plink files, VCF, zarr, etc.
In @eczech's xarray prototype, reading a plink file returns this dataset:
There are two dimensions, "sample" and "variant". The data variables include "data" which provides the genotype calls themselves as counts of alternate alleles. Variables like "sample_id", "is_female", etc., provide data about the samples. Variables like "contig", "pos", "a1", "a2" provide data about the variants.
This approach seems very natural, but it would be good to explore how to generalise to data originating from VCF. We (MalariaGEN) generate VCF files as output from NGS variant calling, but then convert the data to zarr and run all our subsequent analyses from there. For convenience, we use the ability of zarr to organise arrays into a hierarchy of groups. E.g., a typical zarr hierarchy will look something like this:
Here's an example of some real zarr data from Ag1000G following this layout.
In the above, "samples" is an array of sample IDs. We group the data by contig. Within each contig group when then have a sub-group called "variants" containing arrays that provide data about variants, and a sub-group called "calldata" containing arrays providing data about genotype calls.
To perform the VCF to zarr translation we use the current version of scikit-allel, there's a worked example on my blog.
We usually group by contig because this is convenient when working with a genome with a few large contigs (chromosomes). We don't always do this, however, especially when a genome has loads of small contigs it can be more convenient to have data from all contigs concatenated together.
We have separate "variants" and "calldata" sub-groups because coming from VCF you sometimes have INFO and FORMAT fields with the same name, e.g., "DP", we wanted to stick close to the VCF names where possible but avoid a name clash.
So I'm wondering, (1) what would be the best way to organise data like this as an xarray dataset, and (2) can we come up with a common convention so that data like this and data coming from plink can end up looking the same in xarray?
Beta Was this translation helpful? Give feedback.
All reactions