Multiparental populations (MPPs) like the nested association mapping (NAM; McMullen et al. (2009)) or multi-parent advanced generation inter-cross (MAGIC; Cavanagh et al. (2008)) populations have emerged as central genetic resources for research and breeding purposes (Scott et al. 2020). The diallels (Blanc et al. 2006), the factorial designs or the collections of crosses developed in breeding programs (Würschum 2012) can also be considered as MPPs. We can classify MPPs given the use of intrecrossing. In a first type of MPPs, there is no intercrossing. Each genotype inherits its alleles from two parents. Those MPPs can be seen as a collection of crosses (e.g. NAM or factorial design). A second type of MPPs like MAGIC involve genotypes that are a mosaic of more than two parents or founders. In this package, we only consider the first type of MPPs without intercrossing or collections of crosses with shared parents.
Even though several statistical methodologies and software are available to perform QTL detection in MPPs, there is potentially few or no open-source solution for the detection of QTL in MPPs characterized in multiple environments (MPP-ME). In this extension of mppR, we propose a method to detect QTLs in MPP-ME data.
We start from the QTL detection model 3 described in (Garin, Malosetti, and Eeuwijk 2020):
y_icj=μ+ej+ccj+xip∗βpj+ge_icj+ϵ_icj[1]
where y_icj is the genotype mean (BLUE) of genotype i from cross c in environment j. ej is a fixed environment term, ccj a fixed cross within environment interaction, xip is the number of allele copies coming from parent p, and βpj represent the QTL effect of parental allele p in environment j. Different definitions of the QTL allelic effect are possible (Garin et al. 2017). In this version of the model we assumed that each parent carries a different allele and that each of those alleles can have an environmental specific effect. The model is therefore saturated with the estimation of Nenv∗Npar effects with one parent (the most frequently used) set as the reference in all environments (e.g. the central or recurrent parent in NAM).
The term geicj is the residual genetic variation and ϵicj the plot error term. In model 1 because genotype values are averaged over replication ϵicj and geicj are confounded. The variance of the (geicj+ϵicj) term can be modeled with different variance covariance (VCOV) structures. A first possibility is to assume a constant genotypic variance across environments σ2g (compound symmetry - CS) and a unique variance for the plot error term σ2ϵ:
V[yi11yi′21yi12yi′22yi13yi′23]=[σ2g+σ2ϵ0σ2g0σ2g00σ2g+σ2ϵ0σ2g0σ2gσ2g0σ2g+σ2ϵ0σ2g00σ2g0σ2g+σ2ϵ0σ2gσ2g0σ2g0σ2g+σ2ϵ00σ2g0σ2g0σ2g+σ2ϵ]
The CS model requires the estimation of a single term (σ2g) to describe genetic the covariance between all pairs of environments. This model assumes that the background polygenic effect (effect of all genome positions except the tested QTL and cofactors) is the same in all environments (Van Eeuwijk et al. 2001).
A more realistic option called unstructured (UN) model allows the genetic covariance to be different in every pairs of environments. From a genetic point of view, it means that a set of genes have a different effect in each environments (Van Eeuwijk et al. 2001). In the unstructured model, each pair of environment get a specific genetic covariance term cov(y..j,y..j′)=σ2gjj′
V[yi11yi′21yi12yi′22yi13yi′23]=[σ2g1+σ2ϵ0σg120σg1300σ2g1+σ2ϵ0σg120σg13σg120σ2g2+σ2ϵ0σg2300σg120σ2g2+σ2ϵ0σg23σg130σg230σ2g3+σ2ϵ00σg130σg230σ2g3+σ2ϵ]
We considered the UN model as the default option.
The calculation of an ‘exact’ mixed model at each marker (QTL) position of the genome requires a lot of computer power, which can quickly make detection unfeasible, especially with large populations. Therefore, to reduce the computational power needed for a genome scan, we implemented an approximate mixed model computation similar to the generalized least square strategy implemented in Kruijer et al. (2015) or Garin et al. (2017). The procedure consists of estimating a general VCOV (ˆV) using model 1 without the tested QTL position for the simple interval mapping (SIM) scan. For the composite interval mapping (CIM) model, ˆV is estimated including the cofactors as fixed effect without the QTL position. For the CIM scan, there will be as many ˆV as the number of cofactors combinations given cofactor exclusion around the tested position.
There are two option for the estimation of ˆV. In the first option we estimate a unique ˆV taking all cofactors into consideration. In the second option, different ˆV should be formed by removing the cofactor information that is too close of a tested QTL position.
The statistical significance of the tested QTL position and the allelic effects is obtained by substituting ˆV to get the following Wald statistic WQ=βT[V(β)]−1β[2], where β=(XTˆV−1X)−1XTˆV−1y[3], V(β)=(XTˆV−1X)−1, X represents the fixed effect matrix including the cross effect, the cofactors, and the QTL position, and y is the vector of genotypic values. WQ∼χ2d with degree of freedom (d) equal to the number of tested QTL allelic effects ((Npar−1)∗Nenv) for the main QTL term or one for each specific within environment allele).
In model 1, QTL terms are considered as fixed. Therefore, the user
needs to provide best linear unbiased estimates (BLUEs) as trait value.
Beyond estimating a general QTL significance, model 1 allows the
estimation of QTL allelic effect for each parents in each environment.
One parent is set as reference. By default in populations like the
nested association mapping (NAM) populations the central parent is
considered as the reference. For other types of MPPs the reference
parent can be specified using a dedicated function argument
(ref_par
). The parental QTL allelic effect must be
interpreted as the extra effect of one allele copy with respect to the
reference parent in the considered environment.
For QTLs showing a significant QTL by environment (QTLxE) effects, it is possible to investigate which specific dimension of the environment (environmental covariate - EC) explains the QTLxE effect. Therefore, we propose a strategy to estimate the sensitivity of the parental QTL allele to a specific EC (rain, temperature, photoperiod, etc.). For that purpose, we extended the QTL term of model 1 like that xip∗βpj=xip∗(βp+ECe∗Sp+lpe) where βp is the main parental allelic effect across environments, ECe is the EC value in each environment, Sp is the parental allele sensitivity to ECe change and lpe the residual effect. The Sp determine the rate of change of the parental QTL allelic additive effect given an extra unit of EC. The significance of Sp allows to assess the degree of sensitivity of the QTL to the EC. To avoid model saturation, we implemented a procedure that first estimates the QTL parental alleles effect across (main effect term) and between environment (QTLxE term). Then, the function compare the significance of those two terms. The QTL by EC (QTLxEC) is only estimated for the parental allelic effect for which the QTLxE term is more significant than the main effect term.
The function mppGE_proc()
perform the whole
procedure detailed here:
The format of the data is the same as the default format used by
mppR (mppData
objects). The trait
measured in the different environments are simply added as separate
columns of the pheno
argument when you create the
mppData
object with create.mppData()
.
The data used in this vignette (mppData_GE
) come from
the EU-NAM Flint population (Bauer et al.
2013). The genotype data come from five crosses between the
donor parents: D152, F03802, F2, F283, DK105 with the recurrent parent
UH007. We selected a subset of 100 markers spread over chromosomes five
and six. The genetic map was calculated by Giraud
et al. (2014). The phenotypic data represent the
within environment adjusted means for dry matter yield (DMY) calculated
at La Coruna (CIAM), Roggenstein (TUM), Einbeck (KWS), and Ploudaniel
(INRA_P) Lehermeier et al. (2014).
library(mppR)
data(mppData_GE)
design_connectivity(par_per_cross = mppData_GE$par.per.cross)
#> $`1`
#> [1] "UH007" "D152" "F03802" "F2" "F283" "DK105"
To perform the SIM scan, you simply need to pass the
mppData
object to the function. and specify the
environments you want to use in the trait
argument. If
needed, a reference parent can be specified using the
ref_par
argument. For example, here analyse the trait DMY
characterized at CIAM and TUM. After calculation, the SIM profile can be
ploted with plot()
<- mppGE_SIM(mppData = mppData_GE, trait = c('DMY_CIAM', 'DMY_TUM'), ref_par = 'UH007')
SIM plot(x = SIM)
The cofactors as well as the QTLs are selected using the same
function (QTL_select()
). It performs an iterative search
per chromosome starting from the most significant position above the
threshold
value. Then, marker positions falling on the left
and right of the selected position by a distance specified in
window
are excluded. The search continues with the
remaining positions by repeating the selection and exclusion steps. The
search stops when no position remains above the
threshold
.
The default value of window
(50) is deliberately large
to limit the number of included cofactors and not overfit the model. In
the mppGE_proc()
function, the cofactor threshold and
window parameters can be modified through the thre.cof
and
win.cof
arguments.
<- QTL_select(Qprof = SIM, threshold = 4, window = 50) cofactors
To perform the CIM scan, you simply need to pass the selected
cofators to the mppGE_CIM()
function. The
window
parameter specifies the distance on the left and
right where cofactors are excluded in the neighborhood of a tested
position. The VCOV_data
argument allow to specify if the
user want to estimate a unique general VCOV with all cofactors included
(“unique”) or several VCOV removing cofactor in the neighbouring of a
QTL position (“minus_cof”). The unique option is faster.
<- mppGE_CIM(mppData = mppData_GE, trait = c('DMY_CIAM', 'DMY_TUM'),
CIM VCOV = "UN", VCOV_data = "unique", cofactors = cofactors,
window = 20)
The selection of QTLs is done using the same function as the one used
for cofactors. For the selection of QTLs it is possible to restrict the
size of the exclusion window
around the selected positions.
In the mppGE_proc()
function, the QTL threshold and window
parameters can be modified through the thre.QTL
and
win.QTL
arguments.
<- QTL_select(Qprof = SIM, threshold = 4, window = 20) QTL
The function QTL_effect_GE()
allows the estimation of
the QTL allele effects and their significance. It calculates an exact
version of mixed model 1 with single or multiple QTLs. The QTL allele
effects are estimated using [3] and their significance with the Wald
test [2].
The results of the QTL allelic effect estimation are returned in a list where each component represents the effects of one QTL. The individual within environment QTL effects are represented in row while the effects (ˆβ), their standard error, as well as their significance are represented in columns. For example, looking at the first QTL, we can see that in the first environment the the allele of parent DK105 decreases the DMY by -9.861 deciton/ha compared to the central parent UH007. In the second environment, the effect of DK105 is only -1.559. The effect is highly significant in the first environment but not in the second.
<- QTL_effect_GE(mppData = mppData_GE, trait = c('DMY_CIAM', 'DMY_TUM'),
Qeff QTL = QTL)
$QTL_1
Qeff#> Effect Std.dev Wald df p.val sign
#> QTL_1_UH007_E1 NA NA NA NA NA <NA>
#> QTL_1_D152_E1 0.979 3.107 0.12 1 0.7238207754
#> QTL_1_F03802_E1 -4.632 2.265 4.98 1 0.0256716793 *
#> QTL_1_F2_E1 -3.170 3.523 0.97 1 0.3250043320
#> QTL_1_F283_E1 -7.691 2.189 14.75 1 0.0001228353 ***
#> QTL_1_DK105_E1 -9.861 2.615 17.11 1 0.0000352697 ***
#> QTL_1_UH007_E2 NA NA NA NA NA <NA>
#> QTL_1_D152_E2 2.351 3.010 0.77 1 0.3792850449
#> QTL_1_F03802_E2 1.921 2.173 0.93 1 0.3344656579
#> QTL_1_F2_E2 -3.200 3.386 1.07 1 0.3010936767
#> QTL_1_F283_E2 -0.561 2.110 0.08 1 0.7715930671
#> QTL_1_DK105_E2 -1.559 2.513 0.46 1 0.4961859621
A specific parent can be selected as reference using the
ref_par
argument.
<- QTL_effect_GE(mppData = mppData_GE, trait = c('DMY_CIAM', 'DMY_TUM'),
Qeff QTL = QTL, ref_par = 'F2')
$QTL_1
Qeff#> Effect Std.dev Wald df p.val sign
#> QTL_1_UH007_E1 3.170 3.523 8.88 1 0.002882220 **
#> QTL_1_D152_E1 4.149 4.697 2.24 1 0.134327711
#> QTL_1_F03802_E1 -1.462 4.188 0.50 1 0.481184513
#> QTL_1_F2_E1 NA NA NA NA NA <NA>
#> QTL_1_F283_E1 -4.522 4.148 5.10 1 0.023958287 *
#> QTL_1_DK105_E1 -6.691 4.387 7.88 1 0.005002411 **
#> QTL_1_UH007_E2 3.200 3.386 9.78 1 0.001761073 **
#> QTL_1_D152_E2 5.552 4.530 4.31 1 0.037904061 *
#> QTL_1_F03802_E2 5.122 4.023 6.62 1 0.010087175 *
#> QTL_1_F2_E2 NA NA NA NA NA <NA>
#> QTL_1_F283_E2 2.640 3.989 1.87 1 0.171700270
#> QTL_1_DK105_E2 1.641 4.217 0.51 1 0.473716237
The function QTL_R2_GE()
allows the estimation of an
R2 contribution of all the QTLs
(global) and of each QTL positions (partial). For simplicity, the R2 is estimated using a linear model
version of [1] and not a mixed model, which means that the geicj term is absent from the model
and that only a unique variance error term σ2ϵ is estimated. The
R2 values represent therefore
general tendencies and should be taken with caution.
The global adjusted R2 is defined as R2=100∗(1−(RSSfull/dfullRSSred/dred)), where RSSfull and RSSred are the residual sum of squares of the model with and without the QTL positions. dfull and dred represent the degree of freedom of the corresponding models. The partial (adjusted) R2 of QTL i is the difference between the global R2 obtained with all QTLs and the R2 obtained with a model minus QTL i.
<- QTL_R2_GE(mppData = mppData_GE, trait = c('DMY_CIAM', 'DMY_TUM'), QTL = QTL)
QR2
$glb.adj.R2
QR2#> [1] 6.328735
$part.adj.R2.diff
QR2#> Q1 Q2
#> 2.563632 3.582575
The profile plot of the QTL significance can be obtained using the
plot.QTLprof()
function that plots the −log10(p−val) of the position along
the genome.
plot(x = CIM)
It is also possible to visualize the significance of the QTL parental
allele along the by passing the QTL profile information
(CIM
) to plot_Qeff_prof()
function. The
QTL
argument can be used to specify potential QTL
positions.
The color intensity z=−log10(p−val) is proportional to the statistical significance of the parental allelic effects obtained from the Wald test [2]. z has an upper limit of six to not let extreme signficant value influence too much the color scale. Compared to the reference parent (top of the graph) the red (blue) colour represents a negative (positive) effect. The different panels from the top to the bottom represent the different environments.
plot_allele_eff_GE(mppData = mppData_GE, nEnv = 2,
EnvNames = c('CIAM', 'TUM'), Qprof = CIM,
QTL = QTL, text.size = 14)
It is possible to calculate the whole procedure described using the
unique mppGE_proc()
function. This function uses the same
arguments as the one defined in the sub-functions
(e.g. mppData
, trait
, thre.cof
,
win.cof
, window
, ref_par
etc.).
In this function it is possible to specify a working directory in
output.loc
. A folder with the pop.name
and
trait.name
will be created to store intermediate data
output like the SIM and CIM profiles, some results (list of QTLs,
plots), as well as a summary of the results (QTL_REPORT.txt). Here we
can still introduce the argument n.cores
which allow to run
the function in parallel on multiple cores.
<- mppGE_proc(pop.name = 'EUNAM', trait.name = 'DMY',
MPP_GE_QTL mppData = mppData_GE, trait = c('DMY_CIAM', 'DMY_TUM'),
n.cores = 1, verbose = FALSE, output.loc = tempdir())
Once significant QTL effects have been determined, it is possible to
investigate if the parental QTL allelic effect interact with specific
dimension of the environment (environmental covariate - EC) like rain,
temperature, etc. For that purpose you can use the
QTL_effect_QxEC()
. This function first calculates the
signficance of the parental QTL alleles accross environment (main effect
term) and the interaction with the environment (QTLxE term) using the
sub-function QTL_effect_main_QxE()
. Then, it only calculate
the QTL effect sensitivity to the EC for the parental allelic effect
that show a larger significance of the QTLxE term compared to the main
term. To calculate such sensitivity, the user need to provide a list of
QTL
as well as EC
value. The EC value take the
form of a matrix with environment in row and ECs in columns. It is
possible to estimate the effect of several ECs but user should be
careful to have enough degrees of freedom (df) to estimate those
sensitivity curve. For example, with four environments, if estimate the
effect of a unique EC, you only have two remaining df (4 env - 1 df main
effect - 1 df sensitivity slope).
<- matrix(c(180, 310, 240, 280), 4, 1)
EC rownames(EC) <- c('CIAM', 'TUM', 'INRA', 'KWS')
colnames(EC) <- 'cum_rain'
<- QTL_effect_QxEC(mppData = mppData_GE,
Qeff trait = c('DMY_CIAM', 'DMY_TUM', 'DMY_INRA_P', 'DMY_KWS'),
env_id = c('CIAM', 'TUM', 'INRA', 'KWS'),
QTL = QTL, EC = EC)
$Qeff_EC$QTL1 Qeff
The results take the form of a list. The first element gives the
result of the comparison between main effect and QTLxE term significance
expressed in −log10(p−value). The
second term return the estimated main allelic effect as well as the QTL
allelic sensitivity and its significance in the −log10(p−value) scale. The parental QTL
allelic sensitivity can be plotted using the function
plot_QTLxEC()
. The plot represent the sensitivity curve of
the significant parental allelic effect with respect to the reference
parent.
<- plot_QTLxEC(Qeff, Q_id = 2, RP = "UH007", EC_id = 'cum rain', trait_id = 'DMY')
pl pl
We tested the proposed procedure on several sorghum BCNAM populations as well as maize NAM population and breeding MPPs. The table below show the computation time required for different configurations. Computations were realized on an AMD Ryzen 7-cores 5800 U processor with 16 GB RAM.
Populations | Ngeno | Nmarker | Nenv | Ncore | SIM [m] | CIM [h] | QTL_effects [m] | Total [h] |
---|---|---|---|---|---|---|---|---|
BCNAM Grinkan | 1598 | 51545 | 4 | 4 | 20.0 | 11.00 | 15.0 | 11.60 |
BCNAM Kenin-Keni | 575 | 51545 | 4 | 4 | 1.5 | 0.17 | 1.5 | 0.25 |
BCNAM Lata | 896 | 51545 | 3 | 4 | 3.0 | 0.33 | 2.0 | 0.40 |
EUNAM Dent | 841 | 18621 | 4 | 1 | 18.0 | 9.50 | 19.0 | 10.20 |
EUNAM Flint | 811 | 5949 | 6 | 1 | 15.0 | 16.00 | 150.0 | 19.00 |
Breeding pop1 | 2071 | 1812 | 3 | 1 | 5.0 | 1.30 | 9.0 | 1.50 |
Breeding pop2 | 820 | 1760 | 5 | 1 | 9.0 | 10.20 | 33.0 | 10.80 |
From a general point of view, even though we tried our best to speed
up the computation, it can still take several hours to perform a
complete analysis. The computation time first depends on the population
size (number of genotypes), the number of environments, as well as the
number of markers. The use of multiple cores (n.cores
argument) is a first possibility to reduce the computational time.
The computation of the CIM part including cofactors is the most
demanding. The whole procedure computational time can be largely reduced
if the user only perform a SIM (SIM_only = TRUE
in
mppGE_proc()
). Generally, the large effect QTLs are the
same or fall in similar region in the SIM and CIM analysis. The number
of cofactor position selected will also considerably extend the
computational time. For that reason we advise to restrict their number
by selecting maximum one cofactor per chrosome.
Ultimately, even if the computation time can be long we should always put that in balance with the time it required to develop the population, phenotype it and get the genotypic data. Getting quality estimation of the genetic effect is a good way to values those investments.