Introduction
Genetic connectedness quantifies the extent of linkage between individuals across management units. The concept of genetic connectedness can be extended to measure the connectedness level between training and testing sets in whole-genome prediction. We developed the genetic connectedness analysis R package, GCA, which utilizes pedigree or genomic data to measure the connectedness between individuals across units.
Connectedness Statistics
The connectedness statistics implemented in this R package can be classified into two core functions: prediction error variance (PEV) and variance of unit effect estimates (VE). The PEV-derived metrics include prediction error variance of differences (PEVD), coefficient of determination (CD), and prediction error correlation (r). These PEV-derived metrics can be summarized at the unit level as the average PEV within and across units (GrpAve), average PEV of all pairwise differences between individuals across units (IdAve), or using a contrast vector (Contrast). VE-derived metrics include variance of differences in management unit effects (VED), coefficient of determination of VED (CDVED), and connectedness rating (CR). Three correction factors accounting for the number of fixed effects can be applied for each VE-derived metric. These include no correction (0), correcting for one fixed effect (1), and correcting for two or more fixed effects (2). The R code is integrated with C++ to improve computational speed using the Rcpp package [1]. We expect this R package provides a comprehensive and effective tool for genetic connectedness analysis and whole-genome prediction. A comprehensive list of all connectedness statistics implemented in this R package can be summarized as following. The details of these connectedness statistics can be found in the paper [2].
Application of the GCA package
Installing the GCA package from GitHub
install.packages("devtools")
library(devtools)
install_github('uf-aiaos/GCA')
Load all packages used in this vignette.
library(BGLR)
library(corrplot)
library(GCA)
library(ggplot2)
library(gridExtra)
Data preparation
The simulated dataset GCcattle
contains two objects
cattle.pheno
and cattle.W
, which include
phenotypic and marker information, respectively. The details can be
obtained by typing ?GCcattle
.
data(GCcattle)
dim(cattle.pheno) # phenotypes + fixed effects
## [1] 2500 6
dim(cattle.W) # marker matrix
## [1] 2500 10000
The heritability of simulated phenotype was set to 0.6 with \(\sigma^2_a\) = 0.6 and \(\sigma^2_e\) = 0.4.
<- 0.6 # additive genetic variance
sigma2a <- 0.4 # residual variance sigma2e
Below we construct the incidence matrix of fixed effects and a genomic relationship matrix.
<- model.matrix(~ -1 + factor(cattle.pheno$Unit) + factor(cattle.pheno$Sex)) # incidence matrix of unit effect and sex
X2 <- computeG(cattle.W, maf = 0.05) # genomic relationship matrix; markers with minor allele frequency (maf) G
## Genomic relationship matrix has been computed. Number of SNPs removed: 99
# less than 0.05 are removed
Available connectedness statistics in the GCA package
The following section lists all available connectedness statistics in
the GCA package, which are available by setting the argument of
statistic
.
- PEVD_IdAve : Pairwise individual average PEVD, the optional argument of ‘scale’ is available.
- PEVD_GrpAve : Group average PEVD, the optional arguments of ‘scale’ and ‘diag’ are available.
- PEVD_contrast: Contrast PEVD, the optional argument of ‘scale’ is available.
- CD_IdAve : Pairwise individual CD.
- CD_GrpAve : Group average CD, the optional argument of ‘diag’ is available.
- CD_contrast : Contrast CD.
- r_IdAve : Pairwise individual r.
- r_GrpAve : Group average r, the optional argument of ‘diag’ is available.
- r_contrast : Contrast r.
- VED0 : Variance of estimate of unit effects differences. The optional argument of ‘scale’ is available.
- VED1 : Variance of estimate of unit effects differences with the correction of unit effect. The optional argument of ‘scale’ is available.
- VED2 : Variance of estimate of unit effects differences with the correction of unit effect and additional fixed effects. The additional argument of ‘Uidx’ is required to inform the last column number of units in the incidence matrix corresponding to fixed effects and the optional argument of ‘scale’ is available.
- CDVED0 : Coefficient of determination of VED, the optional argument of ‘diag’ is available.
- CDVED1 : Coefficient of determination of VED with the correction of unit effect. The optional argument of ‘diag’ is available.
- CDVED2 : Coefficient of determination of VED with the correction of unit effect and additional fixed effects. The additional argument of ‘Uidx’ is required to inform the last column number of units in the incidence matrix corresponding to fixed effects and the optional argument of ‘diag’ is available.
- CR0 : Connectedness rating.
- CR1 : Connectedness rating with the correction of unit effect.
- CR2 : Connectedness rating with the correction of unit effect and additional fixed effects. The additional argument of ‘Uidx’ is required to inform the last column number of units in the incidence matrix corresponding to fixed effects.
Examples of connectedness measures across units
Below we list some examples of connectedness statistics available in
the GCA package. The gca
function is the main engine in the
GCA package. The usage and detailed arguments of the gca function are
explained below.
gca(Kmatrix, Xmatrix, sigma2a, sigma2e, MUScenario, statistic, NumofMU,
Uidx = NULL, scale = TRUE, diag = TRUE)
- Kmatrix Genetic relationship matrix constructed from either pedigree or genomics.
- Xmatrix: Fixed effects incidence matrix excluding intercept. The first column of the Xmatrix should start with unit effects followed by other fixed effects if applicable.
- sigma2a and sigma2e: Estimates of additive genetic and residual variances, respectively.
- MUScenario: A vector of fixed factor units.
- statistic: Choice of connectedness statistic. Available options
include
- PEV-derived functions: PEVD_IdAve, PEVD_GrpAve, PEVD_contrast, CD_IdAve, CD_GrpAve, CD_contrast, r_IdAve, r_GrpAve, and r_contrast.
- VE-derived functions: VED0, VED1, VED2, CDVED0, CDVED1, CDVED2, CR0, CR1, and CR2.
- NumofMU: Return either pairwise unit connectedness (Pairwise) or
overall connectedness across all units (Overall).
- Uidx: An integer indicating the last column number of units in the Xmatrix. This Uidx is required for VED2, CDVED2, and CR2 statistics. The default is NULL.
- scale (logical): Should sigma2a be used to scale statistic (i.e., PEVD_IdAve, PEVD_GrpAve, PEVD_contrast, VED0, VED1, and VED2) so that connectedness is independent of measurement unit? The default is TRUE.
- diag (logical): Should the diagonal elements of the PEV matrix (i.e., PEVD_GrpAve, CD_GrpAve, and r_GrpAve) or the K matrix (CDVED0, CDVED1, and CDVED2) be included? The default is TRUE.
PEV-derived statistic: PEVD_GrpAve (pairwise vs. overall)
The following example illustrates the group average PEVD between all units. Based on the results, units 1 and 8 are the most connected (PEVD_GrpAve = 0.0156) and the least connected units are units 4 and 6 (PEVD_GrpAve = 0.0571). The PEVD statistic ranges from 0 to 1, with the smaller value indicates more connectedness.
<- gca(Kmatrix = G, Xmatrix = X2, sigma2a = sigma2a, sigma2e = sigma2e,
PEVD_GrpAve MUScenario = as.factor(cattle.pheno$Unit), statistic = 'PEVD_GrpAve',
NumofMU = 'Pairwise')
# remove NAs in diagnol to make plot
diag(PEVD_GrpAve) <- 0
corrplot(PEVD_GrpAve, is.corr = FALSE, method ="circle", type = "upper",
diag = F, number.cex = 7 / ncol(PEVD_GrpAve), col = cm.colors(10), cl.lim = c(0, 0.08),
number.digits = 4, mar = c(0,1,1,1), addCoef.col = "black",
tl.col = "black", tl.srt = 90, tl.cex = 1.2)
Alternatively, the gca
function can return an overall
connectedness, which averages all pairwise PEVD_GrpAve between units by
setting NumofMU
to 'Overall'
.
<- gca(Kmatrix = G, Xmatrix = X2, sigma2a = sigma2a,
PEVD_GrpAve_Overall sigma2e = sigma2e, MUScenario = as.factor(cattle.pheno$Unit),
statistic = 'PEVD_GrpAve', NumofMU = 'Overall')
PEVD_GrpAve_Overall
## [1] 0.03752627
Following the above example, the pairwise individual average PEVD and
contrast PEVD can be easily calculated by changing the argument
statistic
to 'PEVD_IdAve'
and
'PEVD_contrast'
, respectively.
PEV-derived statistic: CD_GrpAve
The group average CD statistic between units is shown in the
following example. The most connected units was found between units 1
and 7 (CD_GrpAve = 0.7096). On the other hand, the least connected
design was found between units 1 and 6 (CD_GrpAve = 0.6494). The larger
CD statistics indicates the greater connectedness. These results are
different from the findings according to PEVD_GrpAve, because CD
penalizes connectedness measures for reduced genetic variability [3, 4]. Similarly, the pairwise individual
average CD and contrast CD can be calculated by changing the argument
statistic
to 'CD_IdAve'
and
'CD_contrast'
, respectively.
<- gca(Kmatrix = G, Xmatrix = X2, sigma2a = sigma2a, sigma2e = sigma2e,
CD_GrpAve MUScenario = as.factor(cattle.pheno$Unit), statistic = 'CD_GrpAve',
NumofMU = 'Pairwise')
# replace NAs in diagnol to make plot
diag(CD_GrpAve) <- min(CD_GrpAve, na.rm = T)
corrplot(CD_GrpAve, is.corr = FALSE, method ="circle", type = "upper",
diag = F, number.cex = 7 / ncol(CD_GrpAve), col = cm.colors(10),
number.digits = 4, mar = c(0,1,1,1), addCoef.col = "black",
tl.col = "black", tl.srt = 90, tl.cex = 1.2, cl.lim = c(0.5, 0.8))
VE-derived statistic: VED
The following example illustrates three VED statistics of VED0, VED1, and VED2 when sex and unit effects are both included in the model. Here, the smaller value indicates the greater connectedness. We can see that the most connected units across three VED statistics are found between units 1 and 8 (VED0 = 0.0205, VED1 = 0.0156, VED2 = 0.0156). On the other hand, units 4 and 6 (VED0 = 0.0728, VED1 = 0.0571, VED2 = 0.0571) shows the least connectedness.
<- gca(Kmatrix = G, Xmatrix = X2, sigma2a = sigma2a, sigma2e = sigma2e,
VED0 MUScenario = as.factor(cattle.pheno$Unit), statistic = 'VED0',
NumofMU = 'Pairwise')
<- gca(Kmatrix = G, Xmatrix = X2, sigma2a = sigma2a, sigma2e = sigma2e,
VED1 MUScenario = as.factor(cattle.pheno$Unit), statistic = 'VED1',
NumofMU = 'Pairwise')
<- gca(Kmatrix = G, Xmatrix = X2, sigma2a = sigma2a, sigma2e = sigma2e,
VED2 MUScenario = as.factor(cattle.pheno$Unit), statistic = 'VED2',
NumofMU = 'Pairwise', Uidx = 8)
# replace NAs in diagnol to make plot
diag(VED0) <- diag(VED1) <- diag(VED2) <- floor(min(VED0, na.rm = T))
corrplot(VED0, is.corr =FALSE, method="circle", type = "upper",
diag = F, number.cex = 7 / ncol(VED0), col = cm.colors(10),
number.digits = 4, addCoef.col = "black", tl.col = "black",
tl.srt = 90, tl.cex = 1.2, cl.lim = c(0, 0.08))
corrplot(VED1, is.corr =FALSE, method="circle", type = "upper",
diag = F, number.cex = 7 / ncol(VED1), col = cm.colors(10),
number.digits = 4, addCoef.col = "black", tl.col = "black",
tl.srt = 90, tl.cex = 1.2, cl.lim = c(0, 0.08))
corrplot(VED2, is.corr =FALSE, method="circle", type = "upper",
diag = F, number.cex= 7 / ncol(VED2), col = cm.colors(10),
number.digits = 4, addCoef.col = "black", tl.col = "black",
tl.srt = 90, tl.cex = 1.2, cl.lim = c(0, 0.08))
VE-derived statistic: CDVED
An example of CDVED statistics (CDVED0, CDVED1, and CDVED2) is shown below. The most connected units are found between units 1 and 7 (CDVED0 = 0.6284, CDVED1 = 0.7098, CDVED2 = 0.7096), whereas units 1 and 6 (CDVED0 = 0.5501, CDVED1 = 0.6494, CDVED2 = 0.6494) have the least connectedness.
<- gca(Kmatrix = G, Xmatrix = X2, sigma2a = sigma2a, sigma2e = sigma2e,
CDVED0 MUScenario = as.factor(cattle.pheno$Unit), statistic = 'CDVED0',
NumofMU = 'Pairwise')
<- gca(Kmatrix = G, Xmatrix = X2, sigma2a = sigma2a, sigma2e = sigma2e,
CDVED1 MUScenario = as.factor(cattle.pheno$Unit), statistic = 'CDVED1',
NumofMU = 'Pairwise')
<- gca(Kmatrix = G, Xmatrix = X2, sigma2a = sigma2a, sigma2e = sigma2e,
CDVED2 MUScenario = as.factor(cattle.pheno$Unit), statistic = 'CDVED2',
NumofMU = 'Pairwise', Uidx = 8)
# replace NAs in diagnol to make plot
diag(CDVED0) <- diag(CDVED1) <- diag(CDVED2) <- 0.5
corrplot(CDVED0, is.corr = FALSE, method ="circle", type = "upper",
diag = F, number.cex = 7 / ncol(CDVED0), col = cm.colors(10),
number.digits = 4, addCoef.col = "black", tl.col = "black",
tl.srt = 90, tl.cex = 1.2, cl.lim = c(0.5, 0.8))
corrplot(CDVED1, is.corr = FALSE, method ="circle", type = "upper",
diag = F, number.cex = 7 / ncol(CDVED1), col = cm.colors(10),
number.digits = 4, addCoef.col = "black", tl.col = "black",
tl.srt = 90, tl.cex = 1.2, cl.lim = c(0.5, 0.8))
corrplot(CDVED2, is.corr = FALSE, method ="circle", type = "upper",
diag = F, number.cex = 7 / ncol(CDVED2), col = cm.colors(10),
number.digits = 4, addCoef.col = "black", tl.col = "black",
tl.srt = 90, tl.cex = 1.2, cl.lim = c(0.5, 0.8))
Relationship between connectedness statistics
In this section, we calculate the correlation between PEV-based and VE-based connectedness statistics to compared with the results reported in Holmes et al. [5].
PEVD vs. VED0, VED1 and VED2
Using the statistics shown in the previous section, the relationships between PEVD_GrPAve and VED0, VED1, and VED2 are visualized via a correlation plot. The statistic PEVD showed a strong relationship with three VED statistics. The strongest relationship (a perfect positive correlation) showed between PEVD and VED2 when two fixed effects of unit effect and sex are corrected. All of these results are consistent with the results claimed in Holmes et al. [5].
<- data.frame(PEVD_GrpAve = PEVD_GrpAve[upper.tri(PEVD_GrpAve)],
df_PEVD_VED VED0 = VED0[upper.tri(VED0)],
VED1 = VED1[upper.tri(VED1)],
VED2 = VED2[upper.tri(VED2)])
corrplot(cor(df_PEVD_VED), method="circle", type = "upper",
diag = F, number.cex= 4 / ncol(df_PEVD_VED), col = cm.colors(10),
number.digits = 6, addCoef.col = "black", tl.col = "black",
tl.srt = 90, tl.cex = 1.2, cl.lim = c(0.8, 1))
CD vs. CDVED0, CDVED1, and CDVED2
The CD_GrpAve statistic is compared with statistics of CDVED0, CDVED1, and CDVED2 in the following example when unit effect and sex are included in the model.
<- data.frame(CD_GrpAve = CD_GrpAve[upper.tri(CD_GrpAve)],
df_CD_CDVED CDVED0 = CDVED0[upper.tri(CDVED0)],
CDVED1 = CDVED1[upper.tri(CDVED1)],
CDVED2 = CDVED2[upper.tri(CDVED2)])
corrplot(cor(df_CD_CDVED), method="circle", type = "upper",
diag = F, number.cex= 4 / ncol(df_CD_CDVED), col = cm.colors(10),
number.digits = 6, addCoef.col = "black", tl.col = "black",
tl.srt = 90, tl.cex = 1.2, cl.lim = c(0.8, 1))
The correlation plot above showed the relationships between CD_GrPAve and CDVED0, CDVED1, and CDVED2. As we expected, analogous strong relationships are identified between CD_GrpAve and three CDVED statistics. An increased correlation is found when a correction factor is used to account for the sex and unit effects. A perfect positive correlation is identified between CD_GrpAve and CDVED2, where unit effect and sex are corrected in the model.
r vs. CR0, CR1, and CR2
The following example compared r_GrpAve with statistics of CR0, CR1, and CR2 when unit effect and sex are included in the model.The correlation plot below showed the relationships between r_GrPAve and CR0, CR1, and CR2. Unlike the stronger relationships between PEVD and VED0 and VED1 statistics above, r showed a weaker correlation with CR0 and CR1 statistics as reported in Holmes et al. [5]. An increased correlation between r_GrpAve and CR1 is found when the unit effect is accounted for using the correction factor. The statistic r_GrpAve is found to have a perfect positive correlation with CR2 when unit effect and sex are corrected in the model.
<- gca(Kmatrix = G, Xmatrix = X2, sigma2a = sigma2a, sigma2e = sigma2e,
r_GrpAve MUScenario = as.factor(cattle.pheno$Unit),
statistic = 'r_GrpAve', NumofMU = 'Pairwise')
<- gca(Kmatrix = G, Xmatrix = X2, sigma2a = sigma2a, sigma2e = sigma2e,
CR0 MUScenario = as.factor(cattle.pheno$Unit), statistic = 'CR0',
NumofMU = 'Pairwise')
<- gca(Kmatrix = G, Xmatrix = X2, sigma2a = sigma2a, sigma2e = sigma2e,
CR1 MUScenario = as.factor(cattle.pheno$Unit), statistic = 'CR1',
NumofMU = 'Pairwise')
<- gca(Kmatrix = G, Xmatrix = X2, sigma2a = sigma2a, sigma2e = sigma2e,
CR2 MUScenario = as.factor(cattle.pheno$Unit), statistic = 'CR2',
NumofMU = 'Pairwise', Uidx = 8)
<- data.frame(r_GrpAve = r_GrpAve[upper.tri(r_GrpAve)],
df_r_CR CR0 = CR0[upper.tri(CR0)], CR1 = CR1[upper.tri(CR1)],
CR2 = CR2[upper.tri(CR2)])
corrplot(cor(df_r_CR), method = "circle", type = "upper",
diag = F, number.cex= 4 / ncol(df_r_CR), col = cm.colors(10),
number.digits = 6, addCoef.col = "black", tl.col = "black",
tl.srt = 90, tl.cex = 1.2, cl.lim = c(0.8, 1))
The connectedness between training and testing sets in the whole-genome prediction paradigm
In whole-genome prediction, the relatedness between training and testing sets is known to impact the prediction accuracy. In this section, the connectedness and prediction accuracy is estimated for each of 5 simulation scenarios in which we gradually increased the relatedness level between the training and testing sets by exchanging 0, 140, 210, 280, and 350 individuals. This was followed by investigating the relationship between the connectedness measures and prediction accuracies using correlation plots.
Estimate of connectedness
In this section, the connectedness levels of 5 simulated unit
scenarios are estimated using gca
function. Two statistics
of PEVD_GrpAve and CD_GrpAve are used to quantify the connectedness
measures.
# Define a function using gca function in GCA
<- function(MU, Kmatrix, sigma2a, sigma2e, stat){
GC <- model.matrix(~ -1 + factor(MU))
X <- gca(Kmatrix, Xmatrix = X, sigma2a, sigma2e,
GCstat MUScenario = as.factor(MU),
statistic = stat, NumofMU = 'Overall')
return(GCstat)
}# 5 simulated unit scenarios
head(CattleSIM[, 7: 11])
## MUSC1 MUSC2 MUSC3 MUSC4 MUSC5
## 1 1 1 1 2 2
## 2 1 2 2 2 2
## 3 1 1 1 1 1
## 4 1 1 1 1 1
## 5 1 1 1 1 1
## 6 1 1 1 1 1
# PEVD_GrpAve
<- apply(CattleSIM[, 7: 11], 2, function(x) GC(x,
PEVD_GrpAve Kmatrix = G, sigma2a, sigma2e, stat = 'PEVD_GrpAve'))
print(PEVD_GrpAve)
## MUSC1 MUSC2 MUSC3 MUSC4 MUSC5
## 0.0039430647 0.0018066760 0.0013236613 0.0010374891 0.0009192205
# CD_GrpAve
<- apply(CattleSIM[, 7: 11], 2, function(x) GC(x,
CD_GrpAve Kmatrix = G, sigma2a, sigma2e, stat = 'CD_GrpAve'))
print(CD_GrpAve)
## MUSC1 MUSC2 MUSC3 MUSC4 MUSC5
## 0.6679398 0.7586556 0.7626448 0.7503289 0.7140141
Estimate of prediction accuracy
The prediction accuracy for each of 5 simulated unit scenarios is estimated using cross-validation.
# Define a function to estimate PA using BGLR package
<- function(df, MU, Iter, BurnIn){
PA <- length(unique(MU))
folds <- matrix(NA, nrow = folds, ncol = 1)
Cor_CV for(fold in 1:folds){
<- df
y <- y
yNa <- which(MU == fold)
whichNa <- NA
yNa[whichNa] <- BGLR(y = yNa, ETA = ETA_GBLUP, nIter = Iter, burnIn = BurnIn, verbose = F)
fit.GBLUP 1] <- cor(fit.GBLUP$yHat[whichNa], y[whichNa])
Cor_CV[fold,
}return(mean(Cor_CV))
}# Prediction accuracies
<- eigen(G)
EVD <- list(list(V = EVD$vectors, d = EVD$values, model = 'RKHS'))
ETA_GBLUP <- apply(CattleSIM[, 7: 11], 2, function(x) PA(df = CattleSIM$Phen, MU = x, 6000, 2000)) PASum
Relationship between connectedness measures and prediction accuracies
The following correlation plots show the relationship between connectedness and prediction accuracy. The plot on the left indicates a positive correlation between PEVD_GrpAve and prediction accuracy, where an increased correlation is observed as more individuals from the same cluster are shared between the training and testing sets. However, the plot in the right indicates CD_GrpAve increased up to Scenario 3 and then decreased at Scenario 4. This decreasing trend is due to CD penalizes the connectedness measure when the genetic variability among population is reduced as reported by previous studies [3, 4]
<- data.frame("PEVD" = PEVD_GrpAve, 'PA' = PASum,
dfPEVD "Scenario" = paste('S', 1:5, sep = ''))
<- data.frame("CD" = CD_GrpAve, 'PA' = PASum,
dfCD "Scenario" = paste('S', 1:5, sep = ''))
<- ggplot(dfPEVD, aes(PEVD, PA)) + geom_point(aes(shape = factor(Scenario)), color = "blue", size = 3) +
p1 xlab("PEVD") + labs(shape="Scenario")
<- ggplot(dfCD, aes(CD, PA)) + geom_point(aes(shape = factor(Scenario)), color="red", size = 3) +
p2 labs(shape = "Scenario")
grid.arrange(p1, p2, nrow = 1, widths = c(10, 10), heights = 4)