GBLUP in sgkit #824
Unanswered
timothymillar
asked this question in
Ideas
Replies: 1 comment
-
Related to https://github.com/pystatgen/sgkit/issues/279. While doing some reading on this topic I encountered Evaluation of GBLUP, BayesB and elastic net for genomic prediction in Chinese Simmental beef cattle (2019) which has PLINK files (bed/bim/fam) on Dryad: https://datadryad.org/stash/dataset/doi:10.5061/dryad.4qc06. Unfortunately when I looked into the |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi all,
I want to start a discussion on genomic prediction, particularly on genomic best linear unbiased prediction (GBLUP). GBLUP is used widely in plant/animal breeding and could attract users from those fields. I'm new to these methods so any corrections are welcome!
Background
Genomic prediction is used to speed up breeding programs by enabling the selection/removal of genotypes without needing to assess their phenotype . This is particularly valuable in slower growing species where phenotyping requires growing an animal or plant to maturity. BLUP methods use breeding values (BV) for a subset of organisms to calculated estimated breeding values (EBV) of other individuals based upon relatedness. A breeding value is essentially a phenotype score (e.g. fruit size, disease tolerance etc.) assigned to a genotype.
The original BLUP method was based on relationship matrices estimated from pedigree structure (often referred to as the ‘A’ matrix) (see Henderson 1984). The issue with pedigree BLUP methods is that pedigree estimation of relatedness gives the expected relatedness rather than the realized relatedness. Realized relatedness varies around the expectation due to Mendelian sampling. Furthermore, pedigree based estimation does not account for cryptic relatedness among founders (i.e. relatedness predating the recorded pedigree).
GBLUP methods replace the A matrix with a genomic relationship matrix (GRM or G matrix) which is estimated from marker data (VanRaden 2008). One of the issues that arises with GBLUP is that genomic data is often not available for all phenotyped individuals. Likewise, it may not be possible to directly measure breeding values for some genotyped individuals (male plants don’t produce fruit to phenotype). These limitations lead to the development of to ‘two step’ (tsGBLUP) methods which involved de-regressed EBVs (I’m not familiar with the details of this process).
An alternative to tsGBLUP is the single step GPLUP (ssGBLUP) (Aguilar et al 2010)
In ssGBLUP an ‘H’ matrix is calculated from the A matrix across all individuals and the G matrix for the subset of individuals with marker data. The H matrix essentially combines the available pedigree and marker data into a single relationship matrix. The R package AGHmatrix provides a variety of options for calculating both the G and H matrix.
After calculating a suitable relationship matrix (A, G or H), breeding values are estimated using a mixed model which can account for fixed effects such as sex. There’s a nice GBLUP tutorial on Rpubs which goes over the matrices in detail.
There are a variety of alternatives to GBLUP in the literature (see Tan 2017 for a comparison). One of these methods is rrBLUP (sometimes called SNP BLUP) which uses ridge regression to estimate breeding values from marker data directly. rrBLUP is equivalent to GBLUP under normal circumstances and doesn’t require computation (or inversion) of the GRM. See the rrBLUP package documentation and related publication more more detail (Endleman 2011)
Possible features
Calculation of the A matrix from pedigree data: This ties into Pedigree data in sgkit #786.
Calculation of the G matrix via the VanRaden method: We already have two methods that could be adapted to calculate the GRM (pc-relate and Weir-Goudet beta), but the VanRaden method is widely used in the literature and has also been generalized to autopolyploids.
Calculation of the H matrix: There are a couple of options for this and I would look to AGHmatrix for proven methods.
GBLUP mixed model: This is normally solved by residual maximum likelihood (REML).
rrBLUP: This may be a better fit than GBLUP for sgkits architecture(?) and can possibly reuse the regenie code or dask-ml.
I think it would be best to look into the VanRaden GRM and GBLUP or rrBLUP as a starting point.
Beta Was this translation helpful? Give feedback.
All reactions