Skip to content

Commit

Permalink
Merge branch 'AlphaGenes:devel' into devel
Browse files Browse the repository at this point in the history
  • Loading branch information
AprilYUZhang authored Oct 9, 2023
2 parents 10143af + 054cbff commit 7586308
Showing 1 changed file with 75 additions and 75 deletions.
150 changes: 75 additions & 75 deletions docs/source/usage.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,66 +6,52 @@ Usage
Program options
===============

|Software| takes in a number of command line arguments to control the program's behavior. To view a list of arguments, run |Software| without any command line arguments, i.e. ``AlphaPeel`` or ``AlphaPeel -h``.


Core Arguments
--------------

::
Core arguments
-out prefix The output file prefix.

The ``-out`` argument gives the output file prefix for where the outputs of |Software| should be stored. By default, |Software| outputs a file with imputed genotypes, ``prefix.genotypes``, phased haplotypes ``prefix.phase``, and genotype dosages ``prefix.dosages``. For more information on which files are created, see "Output Arguments", below.

|Software| takes in several command line arguments to control the program's behaviour. To view a list of arguments, run |Software| without any command line arguments, i.e. ``AlphaPeel`` or ``AlphaPeel -h``.

Input Arguments
----------------
---------------

::

Input Options:
-bfile [BFILE [BFILE ...]]
File(s) in plink (binary) format. Only stable on
Linux).
-pedigree [PEDIGREE [PEDIGREE ...]]
Pedigree file(s) (see format below).
-genotypes [GENOTYPES [GENOTYPES ...]]
File(s) in AlphaGenes format.
Genotype File(s) (see format below).
-seqfile [SEQFILE [SEQFILE ...]]
Sequence data file(s).
-pedigree [PEDIGREE [PEDIGREE ...]]
Pedigree file(s) in AlphaGenes format.
-startsnp STARTSNP The first marker to consider. The first marker in the
file is marker "1".
-stopsnp STOPSNP The last marker to consider.

|Software| requires a pedigree file and one or more genotype files to run the analysis.

|Software| supports binary plink files, ``-bfile``, genotype files in the AlphaGenesFormat, ``-genotypes``, and sequence data read counts in the AlphaGenes format, ``-seqfile``. A pedigree file must be supplied using the ``-pedigree`` option.
Sequence allele read count file(s) (see format below).
-bfile [BFILE [BFILE ...]]
Plink (binary) file(s).
-startsnp STARTSNP The first marker to consider. The first marker is "1".
-stopsnp STOPSNP The last marker to consider.

Use the ``-startsnp`` and ``-stopsnp`` comands to run the analysis only on a subset of markers.
|Software| requires a pedigree file (``-pedigree``) and one or more genomic data files to run the analysis.

The input options in the form of ``[xxx [xxx ...]]`` can take in more than one input file that are seperated by space.
|Software| supports the following genomic data files: genotype files in the AlphaGenes format (``-genotypes``), sequence allele read in the AlphaGenes format (``-seqfile``), and binary Plink files (``-bfile``). Use of binary Plink files requires the package ``alphaplinkpython``, which can be installed via ``pip``, but is only stable for Linux. There are known issues with this package, so we do not advocate its use at the moment.

Binary plink files require the package ``alphaplinkpython``. This can be installed via ``pip`` but is only stable for Linux.
Use the ``-startsnp`` and ``-stopsnp`` to run the analysis only on a subset of markers.

The input options in the form of ``[xxx [xxx ...]]`` can take in more than one input file seperated by space.

Output Arguments
----------------

::

Output options:
-out PREFIX The output file prefix. All file outputs will be stored
as "PREFIX.dosage" and so on.
-writekey WRITEKEY Determines the order in which individuals are ordered
in the output file based on their order in the
corresponding input file. Animals not in the input
corresponding input file. Individuals not in the input
file are placed at the end of the file and sorted in
alphanumeric order. These animals can be suppressed
alphanumeric order. These individuals can be suppressed
with the "-onlykeyed" option. Options: id, pedigree,
genotypes, sequence, segregation. Defualt: id.
-onlykeyed Flag to suppress the animals who are not present in
the file used with -outputkey. Also suppresses "dummy"
animals.
-iothreads IOTHREADS Number of threads to use for io. Default: 1.
genotypes, sequence, segregation. Default: id.
-onlykeyed Flag to suppress the individuals not present in
the file used with "-outputkey". It also suppresses "dummy"
individuals.
-iothreads IOTHREADS Number of threads to use for input/output. Default: 1.


Peeling output options:
Expand All @@ -76,22 +62,22 @@ Output Arguments
-haps Flag to enable writing out the genotype probabilities.
-calling_threshold [CALLING_THRESHOLD [CALLING_THRESHOLD ...]]
Genotype calling threshold(s). Multiple space
separated values allowed. Use. .3 for best guess
separated values allowed. Use .3 for best guess
genotype.
-binary_call_files Flag to write out the called genotype files as a
binary plink output [Not yet implemented].

By default |Software| produces a dosages file, a segregation files and two parameter files (genotyping error and recombination rate). Creation of each of these files can be suppressed with the ``-no_dosages``, ``-no_seg``, and ``-no_params`` options. |Software| can also write out the genotype probability file (.haps) with the `-haps` argument.
By default |Software| produces a dosages file, a segregation files and two parameter files (genotyping error and recombination rate). Creation of these files can be suppressed with the ``-no_dosages``, ``-no_seg``, and ``-no_params`` options. |Software| can also write out the genotype probability file (.haps) with the `-haps` argument.

The ``-calling_threshold`` arguments controls which genotypes (and phased haplotypes) are called as part of the algorithm. A calling threshold of 0.9 indicates that genotypes are only called if greater than 90% of the final probability mass is on that genotype. Using a higher-value will increase the accuracy of called genotypes, but will result in fewer genotypes being called. Since there are three genotypes states, "best-guess" genotypes are produced with a calling threshold less than ``0.33``. ``-calling_threshThe ``-binary_call_files`` option can be used to change the output to a plink binary format.

The order in which individuals are output can be changed by using the ``writekey`` option. This option changes the order in which individuals are written out to the order in which they were observed in the corresponding file. The ```-onlykeyed`` option suppresses the output of dummy individuals (not recommended for hybrid peeling).

The parameter ``-iothreads`` controls the number of threads/processes used by |Software|. |Software| uses additional threads to parse and format input and output files. Setting this option to a value greater than 1 is only recommended for very large files (i.e. >10,000 individuals).
The argument ``-iothreads`` controls the number of threads/processes used by |Software|. |Software| uses additional threads to parse and format input and output files. Setting this option to a value greater than 1 is only recommended for very large files (i.e. >10,000 individuals).

Peeling arguments
------------------

Peeling arguments:
------------------------
::

Mandatory peeling arguments:
Expand Down Expand Up @@ -121,15 +107,15 @@ Peeling arguments:

``-runtype`` controls whether the program is run in "single-locus" or "multi-locus" model. Single locus mode does not use linkage information to perform imputation. It is fast, but not very accurate. Multi-locus mode runs multi-locus iterative peeling which uses linkage information to increase accuracy and calculate segregation values.

For hybrid peeling, where a large amount (millions of segregating sites) of sequence data needs to be imputed, first run the program in multi-locus mode to generate a segregation file, and then run the program in single-locus mode with a known segregation file.
For hybrid peeling, where a large amount (millions of segregating sites) of sequence allele read counts needs to be imputed, first run the program in multi-locus mode to generate a segregation file, and then run the program in single-locus mode with a known segregation file.

The ``-error``, ``-seqerror`` and ``-length`` arguments control some of the parameters used in the model. ``-seqerror`` must not be zero. |Software| is robust to deviations in genotyping error rate and sequencing error rate so it is not recommended to use these options unless large deviations from the default are known. Changing the ``-length`` argument to match the genetic map length can increase accuracy in some situations.

The ``-esterrors`` option estimated the genotyping error rate based on observed information, this option is generally not necessary and can increase runtime. ``-estmaf`` estimates the minor allele frequency after each peeling cycle. This option can be useful if there are a large number of non-genotyped founders.


Hybrid peeling arguments
-----------------------------
------------------------

::

Single locus arguments:
Expand All @@ -139,8 +125,7 @@ Hybrid peeling arguments
peeling.
-segfile SEGFILE A segregation file for hybrid peeling.

In order to run hybrid peeling the user needs to supply a ``-mapfile`` which gives the genetic positions for the SNPs in the sequence data supplied, a ``-segmapfile`` which gives the genetic position for the SNPs in the segregation file, and a ``-segfile`` which gives the segregation values generated via multi-locus iterative peeling. These arguments are not required for running in multi-locus mode.

In order to run hybrid peeling the user needs to supply a ``-mapfile`` which gives the genetic positions for the SNPs in the sequence allele read counts data supplied, a ``-segmapfile`` which gives the genetic position for the SNPs in the segregation file, and a ``-segfile`` which gives the segregation values generated via multi-locus iterative peeling. These arguments are not required for running in multi-locus mode.

============
File formats
Expand All @@ -149,58 +134,65 @@ File formats
Input file formats
------------------

Pedigree file
=============

Each line of a pedigree file has three values, the individual's id, their father's id, and their mother's id. "0" represents an unknown id.

Example:

::

id1 0 0
id2 0 0
id3 id1 id2
id4 id1 id2

Genotype file
=============

Genotype files contain the input genotypes for each individual. The first value in each line is the individual's id. The remaining values are the genotypes of the individual at each locus, either 0, 1, or 2 (or 9 if missing). The following examples gives the genotypes for four individuals genotyped on four markers each.

Example: ::
Example:

::

id1 0 2 9 0
id2 1 1 1 1
id3 2 0 2 0
id4 0 2 1 0

Sequence file
=============
Sequence allele read counts file
================================

The sequence allele read counts file has two lines for each individual. The first line gives the individual's id and read counts for the reference allele. The second line gives the individual's id and allele read counts for the alternative allele.

The sequence data file is in a similar Sequence data is given in a similar format to the genotype data. For each individual there are two lines. The first line gives the individual's id and the read counts for the reference allele. The second line gives the individual's id and the read counts for the alternative allele.
Example:

Example: ::
::

id1 4 0 0 7 # Reference allele for id1
id1 0 3 0 0 # Alternative allele for id2
id1 0 3 0 0 # Alternative allele for id1
id2 1 3 4 3
id2 1 1 6 2
id3 0 3 0 1
id3 5 0 2 0
id4 2 0 6 7
id4 0 7 7 0

Pedigree file
=============

Each line of a pedigree file has three values, the individual's id, their father's id, and their mother's id. "0" represents an unknown id.

Example: ::

id1 0 0
id2 0 0
id3 id1 id2
id4 id1 id2

Binary plink file
=================

|Software| supports the use of binary plink files using the package ``AlphaPlinkPython``. |Software| will use the pedigree supplied by the ``.fam`` file if a pedigree file is not supplied. Otherwise the pedigree file will be used and the ``.fam`` file will be ignored.

Binary Plink files are supported using the package ``AlphaPlinkPython``. The pedigree supplied by the ``.fam`` file will be used if a pedigree file is not supplied. Otherwise, the pedigree file will be used and the ``.fam`` file will be ignored.

Map file
========

The map file gives the chromosome number and the marker name and the base pair position for each marker in two columns. |Software| needs to be run with all of the markers on the same chromosome.
The map file gives the chromosome number, the marker name, and the base pair position for each marker in two columns. Only markers on one chromosome should be provided!

Example:

Example: ::
::

1 snp_a 12483939
1 snp_b 192152913
Expand All @@ -216,7 +208,9 @@ Phase file

The phase file gives the phased haplotypes (either 0 or 1) for each individual in two lines. For individuals where we can determine the haplotype of origin, the first line will provide information on the paternal haplotype, and the second line will provide information on the maternal haplotype.

Example: ::
Example:

::

id1 0 1 9 0 # Paternal haplotype
id1 0 1 9 0 # Maternal haplotype
Expand All @@ -232,7 +226,9 @@ Genotype probability file

The haplotype file (*.haps*) provides the (phased) allele probabilities for each locus. There are four lines per individual containing the allele probability for the (aa, aA, Aa, AA) alleles where the paternal allele is listed first, and where *a* is the reference (or major) allele and *A* is the alternative (or minor) allele.

Example: ::
Example:

::

id1 0.9998 0.0001 0.0001 1.0000
id1 0.0000 0.4999 0.4999 0.0000
Expand All @@ -256,14 +252,15 @@ Dosage file

The dosage file gives the expected allele dosage for the alternative (or minor) allele for each individual. The first value in each line is the individual ID. The remaining values are the allele dosages at each loci. These values will be between 0 and 2.

Example: ::
Example:

::

1 0.0003 1.0000 1.0000 0.0001
2 1.0000 0.0000 1.0000 0.0000
3 0.0003 1.0000 1.0000 0.0001
4 0.0000 0.0000 2.0000 0.0000


Segregation file
================

Expand All @@ -274,7 +271,9 @@ The segregation file gives the joint probability of each pattern of inheritance.
3. the grand **maternal** allele from the father and the grand **paternal** allele from the mother
4. the grand **maternal** allele from the father and the grand **maternal** allele from the mother

Example: ::
Example:

::

id1 1.0000 0.9288 0.9583 0.9834
id1 0.0000 0.0149 0.0000 0.0000
Expand All @@ -299,6 +298,7 @@ Model parameter files
|Software| outputs three parameter files, ``.maf``, ``.seqError``, ``.genoError``. These give the minor allele frequency, sequencing error rates, and genotyping error rates used. All three files contain a single column with an entry for each marker.

Example ``.maf`` file for four loci:

::

0.468005
Expand Down

0 comments on commit 7586308

Please sign in to comment.