01_gds_intro.Rmd

# 1. Introduction to GDS Format

This tutorial introduces Genomic Data Structure (GDS), which is a storage format that can efficiently store genomic data and provide fast  access to subsets of the data. For more information on GDS for sequence data, see the [SeqArray package vignette](https://github.com/zhengxwen/SeqArray/blob/master/vignettes/SeqArrayTutorial.Rmd).

## Convert a VCF to GDS

To use the R packages developed at the University of Washington Genetic Analysis Center for analyzing sequence data, we first need to convert a VCF file to GDS. (If the file is BCF, use [https://samtools.github.io/bcftools/bcftools.html](bcftools) to convert to VCF). 

For these tutorials, we use a subset of data from the 1000 Genomes Project phase 3 callset from 2015 (PMID: 26432245) that has been lifted-over from the GRCh37 to GRCh38 using CrossMap, as described in Byrska-Bishop et al., 2022 (PMID: 36055201). The data is available [here](https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/working/phase3_liftover_nygc_dir/).

```{r vcf2gds, message = FALSE}
library(SeqArray)

repo_path <- "https://github.com/UW-GAC/SISG_2024/raw/main"
if (!dir.exists("data")) dir.create("data")

# file path to the VCF file to *read* data from 
vcffile <- "data/1KG_phase3_GRCh38_subset_chr1.vcf.gz"
if (!file.exists(vcffile)) download.file(file.path(repo_path, vcffile), vcffile)

# file path to *write* the output GDS file to 
gdsfile <- "data/1KG_phase3_GRCh38_subset_chr1.gds"

# convert the VCF to GDS
seqVCF2GDS(vcffile, gdsfile, fmt.import="GT")
```

## Exploring a GDS File

### Open a GDS

We can interact with the GDS file using the [SeqArray R package](https://bioconductor.org/packages/release/bioc/html/SeqArray.html). The first thing we need to do is open a connection to a GDS file on disk using the `seqOpen` function. Note that opening a GDS file does _not_ load all of the data into memory.

```{r seqarray}
# open a connection to the GDS file
gds <- seqOpen(gdsfile)
gds
```

### Reading Data

The `seqGetData` function is the basic function for reading in data from a GDS file
```{r seqGetData}
# the unique sample identifier comes from the VCF header
sample.id <- seqGetData(gds, "sample.id")
length(sample.id)
head(sample.id)

# a unique integer ID is assigned to each variant
variant.id <- seqGetData(gds, "variant.id")
length(variant.id)
head(variant.id)

chr <- seqGetData(gds, "chromosome")
head(chr)

pos <- seqGetData(gds, "position")
head(pos)

id <- seqGetData(gds, "annotation/id")
head(id)
```

There are additional useful functions for summary level data, such as calculating allele frequencies.

```{r minor_freq}
# minor allele frequency of each variant
maf <- seqAlleleFreq(gds, minor = TRUE)
head(maf)
summary(maf)
hist(maf, breaks=50)
```

#### Data Filters

We can define a filter on the `gds` object. After using the `seqSetFilter` command, all subsequent reads from the `gds` object are restricted to the selected subset of data, until a new filter is defined or `seqResetFilter` is called to clear the filter.

```{r filter}
seqSetFilter(gds, variant.id=variant.id[1:10], sample.id=sample.id[1:5])
```

```{r samp.id}
# only returns data for the filtered variants
seqGetData(gds, "sample.id")
```

```{r var.id}
# only returns data for the filtered variants
seqGetData(gds, "variant.id")
seqGetData(gds, "position")
```

### Genotype Data

Genotype data is stored in a 3-dimensional array, where the first dimension is always length 2 for diploid genotypes. The second and third dimensions are samples and variants, respectively. The values of the array denote alleles: `0` is the reference allele and `1` is the alternate allele. For multiallelic variants, other alternate alleles are represented as integers `> 1`.

```{r genotypes}
geno <- seqGetData(gds, "genotype")
dim(geno)
# print the first two variants
geno[,,1:2]
```

The [SeqVarTools R package](http://bioconductor.org/packages/SeqVarTools) has some additional functions for interacting with SeqArray-format GDS files. There are functions providing more intuitive ways to read in genotypes. What does each of the following functions return?

```{r seqvartools_geno}
library(SeqVarTools)

# return genotypes in matrix format
getGenotype(gds)
getGenotypeAlleles(gds)
refDosage(gds)
altDosage(gds)
```

### Variant Information 

There are functions to extract variant-level information. 

```{r seqvartools_varinfo}
# look at reference and alternate alleles
refChar(gds)
altChar(gds)

# data.frame of variant information
variantInfo(gds)
```

We can also return variant information as a `GRanges` object from the [GenomicRanges package](https://bioconductor.org/packages/release/bioc/manuals/GenomicRanges/man/GenomicRanges.pdf). This format for representing sequence data is common across many Bioconductor packages. Chromosome is stored in the `seqnames` column. The `ranges` column has variant position, which can be a single base pair or a range. We will use `GRanges` objects when we analyze sets of variants (e.g. in genes).

```{r granges}
# reset the filter to all variants and samples
seqResetFilter(gds)

gr <- granges(gds)
gr
```

### Close a GDS

Always use the `seqClose` command to close your connection to a GDS file when you are done working with it. Trying to open an already opened GDS will result in an error. 

```{r intro_close}
seqClose(gds)
```


## Exercise 1.1 (Application)

The Apps on the BioData Catalyst powered by Seven Bridges platform allow you to easily scale up cloud computing to running analyses on all chromosomes genome-wide and with larger samples. Use the `VCF to GDS Converter` app to convert the example 1000 Genomes files into GDS files. The steps to perform this analysis are as follows:

- Copy the app to your project if it is not already there:
  - Click: Public Resources > Workflows and Tools > Browse
  - Search for `VCF to GDS Converter`
  - Click: Copy > Select your project > Copy
- Run the analysis in your project:
  - Click: Apps > `VCF to GDS Converter` > Run
  - Specify the Inputs:
    - Variants Files: `1KG_phase3_GRCh38_subset_chr<CHR>.vcf.gz` (select all 22 chromosomes)
  - Specify the App Settings:
    - check GDS: No
  - Click: Run
  
The analysis will take several minutes to run. You can find your analysis in the Tasks menu of your Project. Use the "View stats & logs" button to check on the status of your tasks. Click on the bar that says "vcf2gds", click "View Logs", and click one of the "vcf2gds" folders (there's one per chromosome). In here you can see detailed logs of the job; take a look at the `job.out.log` and `job.err.log` -- these can be useful for debugging issues. 

The output of this analysis will be a set of 22 GDS files, one per chromosome. 

You can find the expected output of this analysis by looking at the existing task `01 Convert VCF to GDS` in the Tasks menu of your Project.