Skip to content

Latest commit

 

History

History
106 lines (78 loc) · 5.69 KB

README.md

File metadata and controls

106 lines (78 loc) · 5.69 KB

Non-reference concordance

Nagraj, V. P., Scholz, M., Jessa, S., Ge, J., Woerner, A. E., Huang, M., Budowle, B., & Turner, S. D. (2022). vcferr: Development, validation, and application of a single nucleotide polymorphism genotyping error simulation framework. F1000Research, 11, 775. https://doi.org/10.12688/f1000research.122840.1

Build

Cloning this repo and building locally:

git clone [email protected]:signaturescience/nrc.git
cd nrc
docker build --no-cache -t nrc .

Or pull directly from the GitHub Container Registry:

docker pull ghcr.io/signaturescience/nrc
docker tag ghcr.io/signaturescience/nrc nrc

Usage

Usage: docker run nrc [-s sample] [-T targets.vcf.gz] <1.vcf.gz> <2.vcf.gz>

Example:

docker run --rm -v $(pwd):$(pwd) -w $(pwd) nrc exampledata/a.vcf.gz exampledata/b.vcf.gz
rrrr	rrra	rraa	rarr	rara	raaa	aarr	aara	aaaa	xrr	xra	xaa	x	m	mm	xm	total	nrd	nrc	disc	conc	rr	ra	aa	p_rrra	p_rraa	p_rarr	p_raaa	p_aarr	p_aara
21	1	2	3	17	4	5	6	13	3	7	11	21	30	51	51	72	0.412	0.588	0.292	0.708	24	24	24	0.0417	0.0833	0.125	0.167	0.208	0.25

Both VCF files must be bgzip-compressed and indexed (either tabix or bcftools index). By default with no options, nrc is using bcftools to compare all samples against all other samples. If you have two VCFs each with one sample (the same sample), this will work as written. If you have two multisample VCFs, you must specify the name of the sample from both VCFs you want to compare, using the -s argument, which is passed to bcftools stats -s/--sample. E.g.:

docker run --rm -v $(pwd):$(pwd) -w $(pwd) nrc -s sample1 exampledata/a.vcf.gz exampledata/b.vcf.gz

Additionally, the -T argument can accept a site VCF or TSV which is further passed to bcftools stats -T/--targets-file, limiting the comparison only to these targets. This is useful to restrict analysis to, for example, sites that have at least X% minor allele frequency from GnomAD, given that filtered site VCF. Example usage:

docker run --rm -v $(pwd):$(pwd) -w $(pwd) nrc -s sample1 -T sites.vcf.gz exampledata/a.vcf.gz exampledata/b.vcf.gz

Running with -v $(pwd):$(pwd) -w $(pwd) assumes vcf1 and vcf2 live in the current working directory. Alternatively, the default workdir in the container is /data, so you could mount a path to your data to /data and run as such:

docker run --rm -v /path/to/host/nrc/exampledata:/data nrc /data/a.vcf.gz /data/b.vcf.gz

Example data from this repo is copied to the container in /exampledata. To debug or experiment, jump into the container interactively with:

docker run --rm -it --entrypoint /bin/ash -w /exampledata nrc

Details

Nonreference concordance (nrc in the output) is calculated as $\text{NRC} = 1 - (x_{RR} + x_{RA} + x_{AA}) / (x_{RR} + x_{RA} + x_{AA} + m_{RA} + m_{AA})$. $x_{RR}$, $x_{RA}$, and $x_{AA}$ are the counts of the mismatches for the homozygous reference, heterozygous and homozygous alternative genotypes, while $m_{RA}$ and $m_{AA}$ are the counts of the matches at the heterozygous and homozygous alternative genotypes. Discordance and concordance (disc and conc in the output, respectively) are calculated conventionally, also using the number of homozygous reference matches $m_{RR}$, which are omitted from the NRC calculation.

The container script is running bcftools stats followed by post-processing in R to pull out the relevant info.

NOTE: The order in which the user inputs VCFs to nrc will not affect the overall concordance, non-reference concordance, discordance, or non-reference discordance metrics. However, the probabilities for switches will be oriented to the specific order in which the VCFs are specified. For example, docker run --rm -v $(pwd):$(pwd) -w $(pwd) nrc exampledata/a.vcf.gz exampledata/b.vcf.gz will produce different p_rraa, p_rrra, etc. compared to docker run --rm -v $(pwd):$(pwd) -w $(pwd) nrc exampledata/b.vcf.gz exampledata/a.vcf.gz. Users should be aware of this effect, especially when using nrc to evaluated batches of samples and conditions.

Metric Value Definition
rrrr 21 Count of RR to RR matches
rrra 1 Count of RR to RA mismatches
rraa 2 Count of RR to AA mismatches
rarr 3 Count of RA to RR mismatches
rara 17 Count of RA to RA matches
raaa 4 Count of RA to AA mismatches
aarr 5 Count of AA to RR mismatches
aara 6 Count of AA to RA mismatches
aaaa 13 Count of AA to AA matches
xrr 3 Count of RR mismatches
xra 7 Count of RA mismatches
xaa 11 Count of AA mismatches
x 21 xrr + xra + xaa
m 30 mra + maa
mm 51 mra + maa + rrrr
xm 51 x + m
total 72 x + m + mrr
nrd 0.4120 Nonreference discordance (1-NRC)
nrc 0.5880 Nonreference concordance (1-NRD)
disc 0.292 Overall discordance (counting RRRR)
conc 0.708 Overall concordance (counting RRRR)
rr 24 Total count of RR homozygotes
ra 24 Total count of RA heterozygotes
aa 24 Total count of AA homozygotes
p_rrra 0.0417 rrra / rr
p_rraa 0.0833 rraa / rr
p_rarr 0.1250 rarr / ra
p_raaa 0.1670 raaa / ra
p_aarr 0.2080 aarr / aa
p_aara 0.2500 aara / aa