fitSplineHDM {statgenHTP} | R Documentation |
Fit P-Spline Hierarchical Curve Data Models
Description
Fit the P-spline Hierarchical Curve Data Model used in the second stage of
the two-stage approach proposed by Pérez-Valencia et al. (2022). This model
assumes a three-level hierarchical structure in the data, with plants nested
in genotypes, genotypes nested in populations. The input for this function
is the spatially corrected data, as obtained from the first stage of the
approach (see fitModels
and getCorrected
).
The number of segments is chosen by the user, as well as the B-spline degree,
and the penalty order for the three-levels of the hierarchy. The user can
also decide if different variances for random effects at genotype (separately
for each population) and plant (separately for each genotype) levels are
desired. The function outputs are estimated curves (time series of trajectories
and deviations) and their first and second derivatives for the three-levels
of the hierarchy. The outputs can then be used to estimate relevant parameters
from the curves for further analysis (see estimateSplineParameters
).
Usage
fitSplineHDM(
inDat,
genotypes = NULL,
plotIds = NULL,
trait,
useTimeNumber = FALSE,
timeNumber = NULL,
pop = "pop",
genotype = "genotype",
plotId = "plotId",
weights = NULL,
difVar = list(geno = FALSE, plot = FALSE),
smoothPop = list(nseg = 10, bdeg = 3, pord = 2),
smoothGeno = list(nseg = 10, bdeg = 3, pord = 2),
smoothPlot = list(nseg = 10, bdeg = 3, pord = 2),
offset = NULL,
family = gaussian(),
maxit = 200,
trace = TRUE,
thr = 0.001,
minNoTP = NULL
)
Arguments
inDat |
A data.frame with corrected spatial data. |
genotypes |
A character vector indicating the genotypes for which
hierarchical models should be fitted. If |
plotIds |
A character vector indicating the plotIds for which
hierarchical models should be fitted. If |
trait |
A character string indicating the trait for which the spline should be fitted. |
useTimeNumber |
Should the timeNumber be used instead of the timePoint?.
If |
timeNumber |
If |
pop |
A character string indicating the the populations to which each genotype/variety belongs. This variable must be a factor in the data frame. |
genotype |
A character string indicating the populations to which each genotype/variety belongs. This variable must be a factor in the data frame. |
plotId |
A character string indicating the genotypes/varieties to which each plant/plot/individual belongs. This variable must be a factor in the data frame. |
weights |
A character string indicating the column in the data containing
the weights to be used in the fitting process (for error propagation from
first stage to second stage). By default, when |
difVar |
Should different variances for random effects at genotype (separately for each population) and plant level (separately for each genotype) be considered?. |
smoothPop |
A list specifying the P-Spline model at the population level (nseg: number of segments; bdeg: degree of the B-spline basis; pord: penalty order). |
smoothGeno |
A list specifying the P-Spline model at the genotype level. |
smoothPlot |
A list specifying the P-Spline model at the plant level. |
offset |
A character string indicating the column in the data with
an a priori known component to be included in the linear predictor during
fitting. By default, when |
family |
An object of class |
maxit |
An optional value that controls the maximum number of iterations of the algorithm. The default is 200. |
trace |
An optional value that controls the function trace.
The default is |
thr |
An optional value that controls the convergence threshold of the algorithm. The default is 1.e-03. |
minNoTP |
The minimum number of time points for which data should be available for a plant. Defaults to 60% of all time points present in the TP object. No splines are fitted for plants with less than the minimum number of timepoints. |
Value
An object of class psHDM
, a list with the following outputs:
time
, a numeric vector with the timepoints.
popLevs
, a data.frame with the names of the populations
genoLevs
, a factor with the names of the genotypes.
plotLevs
, a factor with the names of the plants
nPlotPop
, a numeric vector with the number of plants per
population.
nGenoPop
, a numeric vector with the number of genotypes per
population.
nPlotGeno
, a numeric vector with the number of plants per
genotype.
MM
, a list with the design matrices at plant, genotype and
population levels.
ed
, a numeric vector with the estimated effective dimension
(or effective degrees of freedom) for each random component of the
model (intercept, slope and non-linear trend) at each level of the
hierarchy (population, genotype and plant)
tot_ed
, a numeric value with the sum of the effective
dimensions for all components of the model.
vc
, a numeric vector with the (REML) variance component
estimates for each random component of the model (intercept,
slope and non-linear trend) at each level of the hierarchy
(population, genotype and plant)
phi
, a numeric value with the error variance estimate.
coeff
, a numeric vector with the estimated fixed and random
effect coefficients.
popLevel
, a data.frame with the estimated population trajectories
and first and second order derivatives.
genoLevel
, a data.frame with the estimated genotype-specific
deviations and trajectories, and their respective first and second
order derivatives.
plotLevel
, a data.frame with the estimated plant-specific
deviations and trajectories, and their respective first and second
order derivatives.
deviance
, the (REML) deviance at convergence.
convergence
, a logical value indicating whether the algorithm
managed to converge before the given number of iterations.
dim
, a numeric vector with the (model) dimension of each
model component (fixed and/or random) at each level of the
hierarchy (population, genotype, and plant).
These values correspond to the number of parameters to be estimated.
family
, an object of class family specifying the distribution
and link function.
Vp
, the variance-covariance matrix for the coefficients.
smooth
, a list with the information about number of segments
(nseg), degree of the B-spline basis (bdeg) and penalty order (pord)
used for the three levels of the hierarchy.
References
Pérez-Valencia, D.M., Rodríguez-Álvarez, M.X., Boer, M.P. et al. A two-stage approach for the spatio-temporal analysis of high-throughput phenotyping data. Sci Rep 12, 3177 (2022). doi:10.1038/s41598-022-06935-9
See Also
Other functions for fitting hierarchical curve data models:
plot.psHDM()
,
predict.psHDM()
Examples
## The data from the Phenovator platform have been corrected for spatial
## trends and outliers for single observations have been removed.
head(spatCorrectedArch)
ggplot2::ggplot(data = spatCorrectedArch,
ggplot2::aes(x= timeNumber, y = LeafArea_corr, group = plotId)) +
ggplot2::geom_line(na.rm = TRUE) +
ggplot2::facet_grid(~geno.decomp)
## We need to specify the genotype-by-treatment interaction.
## Treatment: water regime (WW, WD).
spatCorrectedArch[["treat"]] <- substr(spatCorrectedArch[["geno.decomp"]],
start = 1, stop = 2)
spatCorrectedArch[["genoTreat"]] <-
interaction(spatCorrectedArch[["genotype"]],
spatCorrectedArch[["treat"]], sep = "_")
## Fit P-Splines Hierarchical Curve Data Model for selection of genotypes.
fit.psHDM <- fitSplineHDM(inDat = spatCorrectedArch,
trait = "LeafArea_corr",
useTimeNumber = TRUE,
timeNumber = "timeNumber",
genotypes = c("GenoA14_WD", "GenoA51_WD",
"GenoB11_WW", "GenoB02_WD",
"GenoB02_WW"),
pop = "geno.decomp",
genotype = "genoTreat",
plotId = "plotId",
weights = "wt",
difVar = list(geno = FALSE, plot = FALSE),
smoothPop = list(nseg = 4, bdeg = 3, pord = 2),
smoothGeno = list(nseg = 4, bdeg = 3, pord = 2),
smoothPlot = list(nseg = 4, bdeg = 3, pord = 2),
trace = FALSE)
## Visualize the data.frames with predicted values at the three levels of
## the hierarchy.
# Population level
head(fit.psHDM$popLevel)
# Genotype level
head(fit.psHDM$genoLevel)
# Plot level
head(fit.psHDM$plotLevel)