-
Notifications
You must be signed in to change notification settings - Fork 2
/
01_gds_intro.Rmd
171 lines (119 loc) · 6.74 KB
/
01_gds_intro.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
# 1. Introduction to GDS Format
This tutorial introduces Genomic Data Structure (GDS), which is a storage format that can efficiently store genomic data and provide fast access to subsets of the data. For more information on GDS for sequence data, see the [SeqArray package vignette](https://github.com/zhengxwen/SeqArray/blob/master/vignettes/SeqArrayTutorial.Rmd).
## Convert a VCF to GDS
To use the R packages developed at the University of Washington Genetic Analysis Center for analyzing sequence data, we first need to convert a VCF file to GDS. (If the file is BCF, use [https://samtools.github.io/bcftools/bcftools.html](bcftools) to convert to VCF).
For these tutorials, we use a subset of data from the 1000 Genomes Project phase 3 callset from 2015 (PMID: 26432245) that has been lifted-over from the GRCh37 to GRCh38 using CrossMap, as described in Byrska-Bishop et al., 2022 (PMID: 36055201). The data is available [here](https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/working/phase3_liftover_nygc_dir/).
```{r vcf2gds, message = FALSE}
library(SeqArray)
repo_path <- "https://github.com/UW-GAC/SISG_2024/raw/main"
if (!dir.exists("data")) dir.create("data")
# file path to the VCF file to *read* data from
vcffile <- "data/1KG_phase3_GRCh38_subset_chr1.vcf.gz"
if (!file.exists(vcffile)) download.file(file.path(repo_path, vcffile), vcffile)
# file path to *write* the output GDS file to
gdsfile <- "data/1KG_phase3_GRCh38_subset_chr1.gds"
# convert the VCF to GDS
seqVCF2GDS(vcffile, gdsfile, fmt.import="GT")
```
## Exploring a GDS File
### Open a GDS
We can interact with the GDS file using the [SeqArray R package](https://bioconductor.org/packages/release/bioc/html/SeqArray.html). The first thing we need to do is open a connection to a GDS file on disk using the `seqOpen` function. Note that opening a GDS file does _not_ load all of the data into memory.
```{r seqarray}
# open a connection to the GDS file
gds <- seqOpen(gdsfile)
gds
```
### Reading Data
The `seqGetData` function is the basic function for reading in data from a GDS file
```{r seqGetData}
# the unique sample identifier comes from the VCF header
sample.id <- seqGetData(gds, "sample.id")
length(sample.id)
head(sample.id)
# a unique integer ID is assigned to each variant
variant.id <- seqGetData(gds, "variant.id")
length(variant.id)
head(variant.id)
chr <- seqGetData(gds, "chromosome")
head(chr)
pos <- seqGetData(gds, "position")
head(pos)
id <- seqGetData(gds, "annotation/id")
head(id)
```
There are additional useful functions for summary level data, such as calculating allele frequencies.
```{r minor_freq}
# minor allele frequency of each variant
maf <- seqAlleleFreq(gds, minor = TRUE)
head(maf)
summary(maf)
hist(maf, breaks=50)
```
#### Data Filters
We can define a filter on the `gds` object. After using the `seqSetFilter` command, all subsequent reads from the `gds` object are restricted to the selected subset of data, until a new filter is defined or `seqResetFilter` is called to clear the filter.
```{r filter}
seqSetFilter(gds, variant.id=variant.id[1:10], sample.id=sample.id[1:5])
```
```{r samp.id}
# only returns data for the filtered variants
seqGetData(gds, "sample.id")
```
```{r var.id}
# only returns data for the filtered variants
seqGetData(gds, "variant.id")
seqGetData(gds, "position")
```
### Genotype Data
Genotype data is stored in a 3-dimensional array, where the first dimension is always length 2 for diploid genotypes. The second and third dimensions are samples and variants, respectively. The values of the array denote alleles: `0` is the reference allele and `1` is the alternate allele. For multiallelic variants, other alternate alleles are represented as integers `> 1`.
```{r genotypes}
geno <- seqGetData(gds, "genotype")
dim(geno)
# print the first two variants
geno[,,1:2]
```
The [SeqVarTools R package](http://bioconductor.org/packages/SeqVarTools) has some additional functions for interacting with SeqArray-format GDS files. There are functions providing more intuitive ways to read in genotypes. What does each of the following functions return?
```{r seqvartools_geno}
library(SeqVarTools)
# return genotypes in matrix format
getGenotype(gds)
getGenotypeAlleles(gds)
refDosage(gds)
altDosage(gds)
```
### Variant Information
There are functions to extract variant-level information.
```{r seqvartools_varinfo}
# look at reference and alternate alleles
refChar(gds)
altChar(gds)
# data.frame of variant information
variantInfo(gds)
```
We can also return variant information as a `GRanges` object from the [GenomicRanges package](https://bioconductor.org/packages/release/bioc/manuals/GenomicRanges/man/GenomicRanges.pdf). This format for representing sequence data is common across many Bioconductor packages. Chromosome is stored in the `seqnames` column. The `ranges` column has variant position, which can be a single base pair or a range. We will use `GRanges` objects when we analyze sets of variants (e.g. in genes).
```{r granges}
# reset the filter to all variants and samples
seqResetFilter(gds)
gr <- granges(gds)
gr
```
### Close a GDS
Always use the `seqClose` command to close your connection to a GDS file when you are done working with it. Trying to open an already opened GDS will result in an error.
```{r intro_close}
seqClose(gds)
```
## Exercise 1.1 (Application)
The Apps on the BioData Catalyst powered by Seven Bridges platform allow you to easily scale up cloud computing to running analyses on all chromosomes genome-wide and with larger samples. Use the `VCF to GDS Converter` app to convert the example 1000 Genomes files into GDS files. The steps to perform this analysis are as follows:
- Copy the app to your project if it is not already there:
- Click: Public Resources > Workflows and Tools > Browse
- Search for `VCF to GDS Converter`
- Click: Copy > Select your project > Copy
- Run the analysis in your project:
- Click: Apps > `VCF to GDS Converter` > Run
- Specify the Inputs:
- Variants Files: `1KG_phase3_GRCh38_subset_chr<CHR>.vcf.gz` (select all 22 chromosomes)
- Specify the App Settings:
- check GDS: No
- Click: Run
The analysis will take several minutes to run. You can find your analysis in the Tasks menu of your Project. Use the "View stats & logs" button to check on the status of your tasks. Click on the bar that says "vcf2gds", click "View Logs", and click one of the "vcf2gds" folders (there's one per chromosome). In here you can see detailed logs of the job; take a look at the `job.out.log` and `job.err.log` -- these can be useful for debugging issues.
The output of this analysis will be a set of 22 GDS files, one per chromosome.
You can find the expected output of this analysis by looking at the existing task `01 Convert VCF to GDS` in the Tasks menu of your Project.