Xarray conventions for variation data #695

hammer · 2021-10-02T21:10:13Z

hammer
Oct 2, 2021
Maintainer

We'd like to use xarray to organise and work with variation data. It would be good to define some conventions for how we do this, i.e., what we call dimensions, typical names for certain arrays, how we deal with data from multiple contigs, etc. These conventions could then be a common target for data coming from different sources, e.g., plink files, VCF, zarr, etc.

In @eczech's xarray prototype, reading a plink file returns this dataset:

There are two dimensions, "sample" and "variant". The data variables include "data" which provides the genotype calls themselves as counts of alternate alleles. Variables like "sample_id", "is_female", etc., provide data about the samples. Variables like "contig", "pos", "a1", "a2" provide data about the variants.

This approach seems very natural, but it would be good to explore how to generalise to data originating from VCF. We (MalariaGEN) generate VCF files as output from NGS variant calling, but then convert the data to zarr and run all our subsequent analyses from there. For convenience, we use the ability of zarr to organise arrays into a hierarchy of groups. E.g., a typical zarr hierarchy will look something like this:

/
- samples
- {contig}
  - variants
    - POS
    - ID
    - REF
    - ALT
    - FILTER_PASS
    - FILTER_MQ
    - FILTER_...
    - MQ
    - QD
    - ...
  - calldata
    - GT
    - GQ
    - AD
    - ...

Here's an example of some real zarr data from Ag1000G following this layout.

In the above, "samples" is an array of sample IDs. We group the data by contig. Within each contig group when then have a sub-group called "variants" containing arrays that provide data about variants, and a sub-group called "calldata" containing arrays providing data about genotype calls.

To perform the VCF to zarr translation we use the current version of scikit-allel, there's a worked example on my blog.

We usually group by contig because this is convenient when working with a genome with a few large contigs (chromosomes). We don't always do this, however, especially when a genome has loads of small contigs it can be more convenient to have data from all contigs concatenated together.

We have separate "variants" and "calldata" sub-groups because coming from VCF you sometimes have INFO and FORMAT fields with the same name, e.g., "DP", we wanted to stick close to the VCF names where possible but avoid a name clash.

So I'm wondering, (1) what would be the best way to organise data like this as an xarray dataset, and (2) can we come up with a common convention so that data like this and data coming from plink can end up looking the same in xarray?

hammer · 2021-10-02T21:10:40Z

hammer
Oct 2, 2021
Maintainer Author

(Posted by @eric-czech)

My two cents is that I think two good examples of approaches at extremes are Hail and what you did with scikit-allel. In Hail, they basically pass around a data structure that looks a lot like that Zarr group to every single function (though partitioning by chromosome isn't a part of it). I liked what you did in scikit-allel better though because it separates data validation and projection/extraction from the method implementations -- i.e. you don't have to worry about what you're given in a method, you can just assume whatever should be true for something like a GenotypeAlleleCountsArray.

One possible sequence we could go through in workflows would be:

Use skallel/zarr to extract what you need, possibly just subsetting fields enough that you can specify all the array dimensions in a zarr store such that xr.open_zarr works
Provide an "hts" API with methods specific to sequencing datasets that operate on xr.Datasets having no restrictions and verbatim VCF/gVCF field names
- This seems to be a common fork in other packages, e.g. plink -> plink/seq, SNPRelate -> SeqArray
- Some useful methods here could be those related to liftover, splitting and filtering alleles, de novo variant calling, GQ/DP min/max/mean summaries for variants
  - I'd love to get your thoughts on what kinds of comparable things exist in this stage with non-human data
- It could be helpful to have some kind of Dataset extension type at this level too that enforces some structure in these fields (the union of everything used in Hail), but I'm not sure if that's going to be helpful: GT, AD, DP, GQ, PL, GP
Beyond a certain level of QC, start creating highly structured datasets like GenotypeCountDataset to feed to genetics methods

Distilling that a bit:

Use I/O utilities and zarr to create an unmanaged xr.Dataset
Use HTS methods specific to semi-structured xr.Datasets
Use genetics methods specific to well structured xr.Datasets (PLINK is already one of these)

I know I kept using the variable name "data" as the sort of primary array in my prototype examples, but there was a model with counts and indexes too so I was thinking these would contain whatever is relevant for the methods that operate on them. Perhaps it makes sense to try to preserve the closest matching VCF field name in those later-stage structured datasets though? That would mean they could still be xr.merge'd without renaming things but then again that seems like a step backwards in the least structure -> most structure flow.

As far as grouping by contig goes, I've been using something like this for any per-chromosome computations:

# Get number of variants per contig
chunks = np.unique(ds.contig, return_counts=True)[1]
# Rechunk all variables in Dataset at contig boundaries
dsc = ds.chunk(dict(variant=tuple(chunks), sample=ds.dims['sample']))

so then all the variables end up looking like this for 24 contigs:

What I haven't figured out yet is if it's possible to get the dataset representing a single contig without doing a slice on global indexes. I'm also not sure if open_zarr would automatically load into chunks like that if stored as one group per contig. Have you ever tried that?

Thank you as usual for all the details!

0 replies

hammer · 2021-10-02T21:11:33Z

hammer
Oct 2, 2021
Maintainer Author

(Posted by @alimanfoo)

Hi folks, I did a bit of hacking to see if I could write a function to build an xarray dataset holding some existing Zarr data generated via the scikit-allel vcf_to_zarr function. Here's a notebook with code and examples. TL;DR it is very doable and might be useful as a way to experiment with using xarray with some existing data, without having to modify the data to use the special xarray attributes.

0 replies

hammer · 2021-10-02T21:11:47Z

hammer
Oct 2, 2021
Maintainer Author

(Posted by @eric-czech)

Hey @alimanfoo, looking at that example some more I had a few thoughts on what might make for a good convergence in Xarray representations for our use cases and yours. I'm trying to envision a format that would live just downstream of VCF, PLINK, and BGEN readers (like you suggested in the first post) but adhere to just enough conventions that it would contain everything we need for QC and association analysis.

I think a dataset like this could contain a lot of extra "entry" fields from VCF files as well but I wanted to propose some structures for the things I mentioned on our call today that are mostly unrelated to those. Here's a layout I was imagining:

attrs:
  # This would need to be json encoded/decoded for zarr/n5
    reference_genome:
      name: GRCh38
      contig: 
        # Index of these corresponds to `contig` in data vars
        - name: {chr1, length: 248956422}
        # ...
        - name: {chrX, length: 156040895}
        - name: {chrY, length: 57227415}
        - name: {chrM, length: 16569}
      x: [22]
      y: [23]
      mt: [24]
      par:
        - {contig: "chrX", start: 10001, end: 2781480}
coords:
  # Should we try to impose any coordinates/indexes since 
  # they'll be pulled into memory after most operations?
  # It might be safer to leave this off until a user/method 
  # definitely needs it.
data_vars:
  variant/contig: (VARIANTS,) uint8
  variant/pos: (VARIANTS,) int32
  variant/alleles: (VARIANTS,) str
    # - contig + pos + alleles would be the index used to merge on other datasets
    # - A jagged array, even if it was possible w/ zarr/n5, is probably not ideal 
    # - This is also important for inferring allele type SNP/MNP/Indel 
    # - It would be a 1d array of (possibly) ':' delimited strings 
    #   (nucleotides or other symbolic notations)
    # - The order matters and must correspond to `GT` indexes
    # - The first should be major/reference 
      # - For PLINK this means "a2:a1"
      # - For bi-allelic VCF "REF:ALT"
  variant/id: (VARIANTS,) str (optional)
    # - rsID or other standardized name 
  sample/id: (SAMPLES,) str
    # - Must be unique otherwise joining to external tables will fail

  # At least one of either GT or GP should be present (or both):
  entry/GT: (VARIANTS, SAMPLES, PLOIDY) int8 w/ -1 sentinel for missing, optional
    # - The sentinel is useful in numba routines w/o a mask array 
  entry/GT/mask: (VARIANTS, SAMPLES) bool
    # - Genotype missingness mask
  entry/GT/phased: (VARIANTS, SAMPLES) bool
    # - Genotype phasing status
  entry/GP: (VARIANTS, SAMPLES, GENOTYPES) float w/ nan for missing, optional
    # - Only support bi-allelic, unphased genotype probabilities

Chunk boundaries at contig boundaries would be a nice standard too I think, since it would be natural coming from zarr as well as datasets like UKB where the bgen data is already broken up that way.

A few questions for you then are:

What do you think of "entry" instead of "calldata"? I'm borrowing that from Hail, and I made "variants" and "samples" singular to match it.
I don't love the idea of representing alleles as some kind of delimited string, but we will need to index with it and I'm not sure how else to make it generalize to multi-allelic cases. Any thoughts on that?
How do you feel about the name "contig" rather than "chrom"?
I couldn't think of many examples where coordinates are very useful by comparison to indexes, and since both are pulled into memory often (indexes immediately once you define them), I was thinking that it would be safer to leave the definition of either to methods or users at the last possible moment. How do you feel about that one?
Do you ever have a need for metadata like what I'm proposing here in the reference_genome attribute? We'll need those in several places and I'm wondering how well they map to analogous concepts in other organisms.

0 replies

hammer · 2021-10-02T21:12:01Z

hammer
Oct 2, 2021
Maintainer Author

(Posted by @tomwhite)

A quick comment on the missingness mask (entry/GT/mask). Since xarray doesn't have any inbuilt support for masks, an explicit representation like this seems like the right approach - at least to start with. Backend processing can use whichever mask implementation is appropriate - e.g. numpy.ma, dask.array.ma, or possibly MaskedArray.

0 replies

hammer · 2021-10-02T21:12:15Z

hammer
Oct 2, 2021
Maintainer Author

(Posted by @tomwhite)

Overall, the xarray representation seems very promising.

@eczech is it worth trying to translate it into code so we can get a simple end-to-end analysis working? I was thinking that it would be a thin wrapper to xarray (the arguments are arrays):

def create_genotype_call_dataset(genotypes,
        genotypes_mask=None,
        genotypes_phased=None, ...) -> xr.Dataset:
    return xr.Dataset(...)

Attributes can be added in a later iteration.

GitHub is currently down, but perhaps we should move to PR/discussion there?

0 replies

hammer · 2021-10-02T21:12:27Z

hammer
Oct 2, 2021
Maintainer Author

(Posted by @eric-czech)

Sounds great @tomwhite, it makes sense to me to start there and to move any more discussion to the repo.

0 replies

hammer · 2021-10-02T21:12:40Z

hammer
Oct 2, 2021
Maintainer Author

(Posted by @alimanfoo)

Hi Eric,

Sorry for slow reply on this one. Like the direction, some potted thoughts...

Re prefixes, very happy to go with singular "variant/" and "sample/", wish I had done that in scikit-allel originally.

Happy to drop "calldata/", although I'd have a mild preference for "call/" instead of "entry/" just because "entry/" sounds a bit bland. But it may help to have naming similarity with HAIL so not a strong objection.

Re "variant/contig", happy with this. The name "CHROM" in VCF was always badly chosen for organisms without a chromosome-level assembly. Would "variant/contig" values always be an index into a list of contig names stored in metadata? Most users will always want to make data selections using contig name rather than contig index, so we'd need to make sure that was handled in the API somehow.

Re "variant/pos" and "variant/id", happy with these, although (mundane point) capitalisation? In VCF these are "POS" and "ID". What's our rule on when we diverge from VCF capitalisation versus when we preserve it? E.g., you currently have "entry/GT" (capitalised).

Side point, what about data coming from VCF INFO field? E.g., if a VCF has an "MQ" field in it's INFO field, where would that end up? I think I'd vote for "variant/MQ", but just wanted to check.

Re "variant/alleles", happy with this name and storing as a single array rather than separate "REF" and "ALT". We might want to leave some flexibility on the shape and dtype, e.g., I currently work with SNPs and it's easy and convenient to have this as a (VARIANTS, 4) shape array with dtype "S1".

Btw would be good to also explicitly discuss standard dimension names, you're post implicitly suggested "VARIANTS", "SAMPLES", "PLOIDY", "GENOTYPES", all of which sound good to me, we might need a few others like "ALLELES".

Re "sample/id", happy with that, only question is capitalisation (we should match whatever happens with "variant/id").

Re "entry/GT", happy with that, although see above re capitalisation.

Re "entry/GP", what about "entry/GL"? Generally these are talked about as genotype likelihoods. And what about phred-scaled likelihoods (PL)?

Re "entry/GT/mask", couple of questions. If entry/GT uses -1 as missing signal, is entry/GT/mask a calculated field? Is it True where GT is called, or where it's missing?

Re "entry/GT/phased", happy with that.

Re coordinates, I don't fully understand how xarray deals with them, happy to go with the flow for now. I guess this will all come out when we figure out the API and indexing implementation for selecting data in a given genome region.

Re metadata, we don't currently have a need for reference genome metadata, we always "know" what reference genome was used and load that in via pyfasta or pyfaidx when we need it. But that's mainly because we're working with data we made ourselves, if consuming someone else's data then valuable to have some metadata. Also if we need the contig names stored.

Cheers,
Alistair

0 replies

hammer · 2021-10-02T21:12:52Z

hammer
Oct 2, 2021
Maintainer Author

(Posted by @eric-czech)

Thanks @alimanfoo! A few more thoughts/responses:

I’d have a mild preference for “call/” instead of “entry/” just because “entry/” sounds a bit bland

👍

I think there are arguments that could be made for row/col/entry monickers rather than variant/sample/call or for just omitting the prefix and using search functions based on dimensions present in arrays (kind of like pandas.DataFrame.select_dtypes), but I've got no objections to variant/sample/call if that sits well with you.

“variant/pos” and “variant/id”, happy with these, although (mundane point) capitalisation? In VCF these are “POS” and “ID”. What’s our rule on when we diverge from VCF capitalisation versus when we preserve it? E.g., you currently have “entry/GT” (capitalised).

I think going either all lower or upper for each group of variables makes sense (to avoid filesystem case sensitivity issues). I'm partial to [a-z_]+ since it has high portability with naming rules in other systems and involves a few less keystrokes but we're already breaking it with '/' and upper case info fields so I don't have a strong opinion. I'm on board with all uppercase for all variables after the initial lowercase prefix if it means our VCF to xarray converter would need to change less.

Side point, what about data coming from VCF INFO field? E.g., if a VCF has an “MQ” field in it’s INFO field, where would that end up? I think I’d vote for “variant/MQ”, but just wanted to check.

Yep, that's what I was thinking too. I wanted to standardize these fields because I think we'll all care about them as inputs to LD-pruning, PCA, and QC operations but I'm hoping these conventions for missingness, naming, phasing and genome reference will apply to any other INFO fields that are necessary.

We might want to leave some flexibility on the shape and dtype, e.g., I currently work with SNPs and it’s easy and convenient to have this as a (VARIANTS, 4) shape array with dtype “S1”.

Nice, I think it's good progress to agree that we'll represent alleles in a single field (i.e. not separate "a1", "a2", etc. fields). A (VARIANTS, 4) array will break down for some of our use cases but they'll be less common and more specific to working across studies. If it's at least in one field, I think it won't be too hard to automatically wrap and unwrap from an indexable form later on.

Re “sample/id”, happy with that, only question is capitalisation (we should match whatever happens with “variant/id”).

Let's do sample/ID and variant/ID?

Re “entry/GP”, what about “entry/GL”? Generally these are talked about as genotype likelihoods. And what about phred-scaled likelihoods (PL)?

My assumption is that many of the VCF fields are used in ad-hoc QC and eventually stop being relevant once you get to this point just upstream from LD-pruning, PCA, kinship estimation, association tests, etc. I don't see any reason they couldn't be included though, and they probably don't need to adhere to conventions for missingness or something like representations of phased genotype likelihoods if there aren't going to be a bunch of downstream methods implementations that rely on a standard for them.

Re “entry/GT/mask”, couple of questions. If entry/GT uses -1 as missing signal, is entry/GT/mask a calculated field? Is it True where GT is called, or where it’s missing?

I think enforcing the presence of a mask that is true where values are missing is the simplest solution. I say that because like @tomwhite suggested, then we could use masked backends where appropriate (following the True=missing convention there) or embed knowledge of the sentinel into numba/cuda routines (ignoring the mask).

I had originally thought it would make more sense to lazily compute the masks on-demand and never serialize them (e.g. as some cached property on dataset subclasses), but that may be an unnecessary level of optimization if we want to get moving faster. Let us know if you have any thoughts/preferences there based on your past experiences with it.

Thanks again for the feedback! This will make it much easier to know what to target with our I/O work.

0 replies

hammer · 2021-10-02T21:13:04Z

hammer
Oct 2, 2021
Maintainer Author

(Posted by @alimanfoo)

[quote="eczech, post:9, topic:44"]

Re “entry/GP”, what about “entry/GL”? Generally these are talked about as genotype likelihoods. And what about phred-scaled likelihoods (PL)?

My assumption is that many of the VCF fields are used in ad-hoc QC and eventually stop being relevant once you get to this point just upstream from LD-pruning, PCA, kinship estimation, association tests, etc. I don’t see any reason they couldn’t be included though, and they probably don’t need to adhere to conventions for missingness or something like representations of phased genotype likelihoods if there aren’t going to be a bunch of downstream methods implementations that rely on a standard for them.
[/quote]

Showing my ignorance here, I was thinking of genotype likelihoods and genotype probabilities as being the same thing. Rereading the VCF spec I see there is a distinction made between genotype likelihoods (GL) and genotype probabilities from imputation (GP).

So do we expect call/GP to always contain unphased genotype probabilities from imputation?

And do we expect call/GP to be phred-scaled or not? (I'm guessing not, but noting that VCF assumes phred-scaled.)

0 replies

hammer · 2021-10-02T21:13:14Z

hammer
Oct 2, 2021
Maintainer Author

(Posted by @eric-czech)

Ah good point. Hail loads probabilities from bgen to GP without scaling so I think they're taking a little liberty with the field definition.

GT and/or dosage are the core fields we would care about for most methods so perhaps all that is really left to standardize is a dosage field. Are you aware of a convention for that in VCF?

Otherwise, I think our bgen reader should instead create a dataset with GT, dosage, and GP with parameterizations that control how GT and dosage are created (if at all) and GP can simply be another field that we don't have to make any guarantees about until some downstream method implementations force us to.

tl;dr I should have made call/DOSAGE a part of the proposal rather than call/GP, with the same dtype + float nan sentinel. I think it will be more specific to our use cases though so there probably isn't much need to worry about stepping on each other's toes with it.

0 replies

hammer · 2021-10-02T21:13:24Z

hammer
Oct 2, 2021
Maintainer Author

(Posted by @alimanfoo)

Re dosage, I found some info on how plink handles dosage when importing data from VCF here.

Also noting that the VCF 4.3 spec seems to have clarified a little the meaning of the reserved format fields GL, GP, PL, PP.

0 replies

hammer · 2021-10-02T21:13:38Z

hammer
Oct 2, 2021
Maintainer Author

(Posted by @tomwhite)

Thanks for all the discussion and feedback. I have started turning this into code here: https://github.com/pystatgen/sgkit/pull/10

Would “variant/contig” values always be an index into a list of contig names stored in metadata? Most users will always want to make data selections using contig name rather than contig index, so we’d need to make sure that was handled in the API somehow.

This is how I have currently coded it, but there are some subtleties around how you get the index - is it provided? (depends on the IO format), how are names ordered?, ... etc. Hopefully this will become clearer once we start using it.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Xarray conventions for variation data #695

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 12 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Xarray conventions for variation data #695

hammer Oct 2, 2021 Maintainer

Replies: 12 comments

hammer Oct 2, 2021 Maintainer Author

hammer Oct 2, 2021 Maintainer Author

hammer Oct 2, 2021 Maintainer Author

hammer Oct 2, 2021 Maintainer Author

hammer Oct 2, 2021 Maintainer Author

hammer Oct 2, 2021 Maintainer Author

hammer Oct 2, 2021 Maintainer Author

hammer Oct 2, 2021 Maintainer Author

hammer Oct 2, 2021 Maintainer Author

hammer Oct 2, 2021 Maintainer Author

hammer Oct 2, 2021 Maintainer Author

hammer Oct 2, 2021 Maintainer Author

hammer
Oct 2, 2021
Maintainer

hammer
Oct 2, 2021
Maintainer Author

hammer
Oct 2, 2021
Maintainer Author

hammer
Oct 2, 2021
Maintainer Author

hammer
Oct 2, 2021
Maintainer Author

hammer
Oct 2, 2021
Maintainer Author

hammer
Oct 2, 2021
Maintainer Author

hammer
Oct 2, 2021
Maintainer Author

hammer
Oct 2, 2021
Maintainer Author

hammer
Oct 2, 2021
Maintainer Author

hammer
Oct 2, 2021
Maintainer Author

hammer
Oct 2, 2021
Maintainer Author

hammer
Oct 2, 2021
Maintainer Author