Efficiently identifying shared genetic segments in large-scale data.
Reference: Saada et al. Identity-by-descent detection across 487,409 British samples reveals fine scale population structure and ultra-rare variant associations, 2020, Nature Communications
The boost/1.57.0
library is required.
make
g2 [options] <haps file> <sample file> <genetic map file> <output file>
Required inputs:
- haps file: SHAPEIT/IMPUTE format input phased haplotypes, alleles can be any string as long as haplotype entries are 0/1
- sample file: SHAPEIT/IMPUTE format input sample identifiers, only the second ID column is currently used for output
- genetic map file: Each row has three fields: [
physical position
] [cm/Mb
] [cM
], and the 2nd field is ignored - output file: Pointer to where the outputs will go, will generate an $OUT.match file
Optional switches:
switch | description |
---|---|
-b |
Binary output for large files, see parse_bmatch [default off] |
-d |
Dynamic hash seed cutoff (for big N) [default = 0/off] |
-f |
Minimum minor allele frequency [default = 0.0] |
-g |
Allowed gaps between seeds [default = 1] |
-h |
Haploid mode, do not allow switches between haplotypes [default off] |
-m |
Minimum match length [default = 1.0] |
-s |
Skip words with (seeds/samples) less than than this value (for big N) [default = 0.0] |
Output goes into a $OUT.match file with each row containing the following entries:
ID1 | ID2 | P0 | P1 | cM | # words | # gaps |
---|
If haploid mode is on (-h
) then ".0" or ".1" is appended to the IDs to indicate a match along the first or second haplotype.
For large data, you can enable binary outputs by adding the -b
switch, which will generate three files ($OUT.bmatch/bmid/bsid
) that can be parsed using the provided parse_bmatch
program (~3x reduction in file size).
make test
runs sample data in the example/
directory using the following command:
./g2 -m 0.9 \
example/SIM.NE_20000.MATCH_FREQ.SHAPEIT.haps \
example/SIM.NE_20000.MATCH_FREQ.SHAPEIT.sample \
example/genMap.1KG.b37.chr1.map \
example/SIM.NE_20000.MATCH_FREQ.INFERRED.match
The output segments are then evaluated for accuracy using the example/accuracy.sh
script.
This data was simulated using the ARGON software as shown in example/sim.sh
, down-sampled to a HapMap3 allele frequency distribution, and phased with SHAPEIT2.