R: Designing new corrFamily descriptors for parametric...

corrFamily-design {spaMM}

R Documentation

Designing new corrFamily descriptors for parametric correlation families

Description

This documentation describe additional design features to be taken into account when defining a new corrFamily descriptor for a correlation model. Using such a descriptor will be more efficient than the equally general method, of maximizing an objective function of the correlation parameters that calls (say) fitme() on a model including a corrMatrix itself function of the correlation parameters. But this may still be inefficient if a few issues are ignored.

For elements of the corrFamily descriptor for basic cases:

Cf: The function value should (1) be of constant class for all parameter values. For families of mathematically sparse matrices, the CsparseMatrix class is recommended (and more specifically the dsCMatrix class since the matrix is symmetric); (2) have row names that match the levels of the grouping factor (the nested random effect Example shows the code needed when this nested effect is defined from two variables).
tpar: In order to favor the automatic selection of suitable algorithms, tpar should be chosen so that Cf(tpar) is least sparse (i.e., has the minimal number of elements equal to zero) in the correlation family, in terms of its sparsity and of the sparsity of its inverse. A tpar yielding an identity matrix is often a *bad* template as least sparse correlation matrices and their inverses are denser for most families except diagonal ones. For degerate corrFamily objects that describe a constant correlation model rather than a parametric family, use tpar=numeric(0).
type: Do not forget type="precision" it if the return value of Cf is an inverse correlation matrix rather than a correlation matrix, in which case one should specify .
calc_moreargs: should have formal arguments including at least corrfamily and .... The source code of ARp, ARMA or diallel shows the expected structure of its return value.

For advanced features of the corrFamily descriptor:

Af: Af has (minimally) three formal arguments (newdata, term, ...). spaMM will call Af with distinct values of the newdata argument for the fit, and for predictions for new data. For the curious: the term argument that will be provided by spaMM to Af is the formula term for the random effect – an object of class call, as obtained e.g. by
( ~ 1+ corrFamily(1 | longitude + latitude))[[2]][[3]] –, which will provide the names of the variables that need to be taken from the newdata to construct the matrix returned by Af.

Details

spaMM will regularize invalid or nearly-singular correlation or covariance matrices internally if the correlation function has not done so already, but it it better to control this in the correlation function. The regularize convenience function is available for that purpose, but parametrizations that avoid the need for regularization are even better, since fitting models with nearly-singular correlation matrices is prone to various difficulties (The Toeplitz example below is good to illustrate potential problems but is otherwise poor as it produces non-positive definite matrices; the ARp constructor illustrates a parametrization that avoids that problem).
Users should make sure that any regularized matrix still belongs to the intended parametric family of matrices, and they should keep in mind that the spaMM output will show the input parameters of the unregularized matrix, not the parameters of the regularized one (e.g., in the Toeplitz example below, the fitted matrix is a regularized Toepliz matrix with slightly different coefficients than the input parameters).

And for efficiency,
Let us repeat that the correlation function should return matrices of constant class, and in sparse format when the matrices are indeed mathematically sparse. For mathematically dense matrices (as in the Toeplitz example below), the dsyMatrix class may be suitable.
Let us repeat that in order to favor the automatic selection of suitable algorithms, tpar should be chosen so that Cf(tpar) is least sparse in the correlation family. For matrices of CsparseMatrix, a check is implemented to catch wrong choices of tpar.
For challenging problems (large data, many parameters...) it may pay to optimize a bit the correlation function. The Example of nested effects with heterogenous variance below illustrates a possible trick. In the same cases, It may also pay to try the alternative algebraic methods, by first comparing speed of the different methods (control.HLfit=list(algebra= <"spprec"|"spcorr"|"decorr">)) for given correlation parameter values, rather than to assume that spaMM will find the best method (even if it often does so).
The corrFamily descriptor may optionally contain booleans possiblyDenseCorr and sparsePrec to help spaMM select the most appropriate matrix algebraic methods. sparsePrec should be set to TRUE if sparse-precision methods are expected to be efficient for fitting the random effect. possiblyDenseCorr should be set to FALSE if the correlation matrix is expected to be sparse, which means here that less than 15% of its elements are non-zero.

Examples

if (spaMM.getOption("example_maxtime")>2 &&
      requireNamespace("agridat", quietly = TRUE)) {

data("onofri.winterwheat", package="agridat")

##### Fitting a Toeplitz correlation model for temporal correlations #####

# A Toeplitz correlation matrix of dimension d*d has d-1 parameters 
# (by symmetry, and with 1s on the main diagonal). These d-1 parameters 
# can be fitted as follows:

Toepfn <- function(v) {
  toepmat <- Matrix::forceSymmetric(toeplitz(c(1,v))) # dsyMatrix
  # Many of the matrices in this family are not valid correlation matrices;
  #   the regularize() function is handy here:
  toepmat <- regularize(toepmat, maxcondnum=1e12)
  # And don't forget the rownames!
  rownames(toepmat) <- unique(onofri.winterwheat$year)
  toepmat
}

(Toepfit <- spaMM::fitme(
  yield ~ gen + corrFamily(1|year), data=onofri.winterwheat, method="REML",
  covStruct=list(corrFamily=list(Cf=Toepfn, tpar=rep(1e-4,6))), 
         # (Note the gentle warning if one instead uses tpar=rep(0,6) here)  
  lower=list(corrPars=list("1"=rep(-0.999,6))), 
  upper=list(corrPars=list("1"=rep(0.999,6))))) 

# The fitted matrix is (nearly) singular, and was regularized:

eigen(Corr(Toepfit)[[1]])$values

# which means that the returned likelihood may be inaccurate, 
# and also that the actual matrix elements differ from input parameters:

Corr(Toepfit)[[1]][1,-1]  

### The usual rules for specifying covStruct, 'lower', 'upper' and 'init' apply
# here when the corrFamily term is the second random-effect:

(Toep2 <- spaMM::fitme(
        yield ~ 1 + (1|gen) + corrFamily(1|year), data=onofri.winterwheat, method="REML",
        covStruct=list("1"=NULL, corrFamily=list(Cf=Toepfn, tpar=rep(1e-4,6))), 
        , init=list(corrPars=list("2"=rep(0.1,6))),
        lower=list(corrPars=list("2"=rep(-0.999,6))), 
        upper=list(corrPars=list("2"=rep(0.999,6)))))

##### Fitting one variance among years per each of 8 genotypes. #####

# First, note that this can be *more efficiently* fitted by another syntax:

### Fit as a constrained random-coefficient model: 
    
# Diagonal matrix of NA's, represented as vector for its lower triangle:
ranCoefs_for_diag <- function(nlevels) { 
  vec <- rep(0,nlevels*(nlevels+1L)/2L)
  vec[cumsum(c(1L,rev(seq(nlevels-1L)+1L)))] <- NA
  vec
} 
    
(by_rC <- spaMM::fitme(yield ~ 1 + (0+gen|year), data=onofri.winterwheat, method="REML",
                       fixed=list(ranCoefs=list("1"=ranCoefs_for_diag(8)))))
                         
### Fit as a corrFamily model:   

gy_levels <- paste0(gl(8,1,length =56,labels=levels(onofri.winterwheat$gen)),":",
                        gl(7,8,labels=unique(onofri.winterwheat$year)))
                        
# A log scale is often suitable for variances, hence is used below;

# a correct but crude implementation of the model is
diagf <- function(logvar) {
  corr_map <- kronecker(Matrix::.symDiagonal(n=7),diag(x=exp(logvar)))
  rownames(corr_map) <- gy_levels
  corr_map
}

# but we can minimize matrix operations as follows:

corr_map <- Matrix::.symDiagonal(n=8,x=seq(8))
rownames(corr_map) <- unique(onofri.winterwheat$gen)
      
diagf <- function(logvar) {
  corr_map@x <- exp(logvar)[corr_map@x]
  corr_map
}                 # (and this returns a dsCMatrix)
      
(by_cF <- spaMM::fitme(
        yield ~ 1 + corrFamily(1|gen %in% year), data=onofri.winterwheat, method="REML",
        covStruct=list(corrFamily = list(Cf=diagf, tpar=rep(1,8))), 
        fixed=list(lambda=1),            # Don't forget to fix this redundant parameter!
        # init=list(corrPars=list("1"=rep(log(O.1),8))), # 'init' optional 
        lower=list(corrPars=list("1"=rep(log(1e-6),8))), # 'lower' and 'upper' required
        upper=list(corrPars=list("1"=rep(log(1e6),8)))))

# => The 'gen' effect is nested in the 'year' effect and this must be specified in the 
# right-hand side of corrFamily(1|gen %in% year) so that the design matrix 'Z' for the 
# random effects to have the correct structure. And then, as for other correlation
# structures (say Matern) it should be necessary to specify only the correlation matrix 
# for a given year, as done above. Should this fail, it is also possible to specify the 
# correlation matrix over years, as done below. spaMM will automatically detect, from   
# its size matching the number of columns of Z, that it must be the matrix over years.  

corr_map <- Matrix::forceSymmetric(kronecker(Matrix::.symDiagonal(n=7),diag(x=seq(8))))
rownames(corr_map) <- gy_levels

diagf <- function(logvar) {
  corr_map@x <- exp(logvar)[corr_map@x]
  corr_map
}                 # (and this returns a dsCMatrix)

(by_cF <- spaMM::fitme(
  yield ~ 1 + corrFamily(1|gen %in% year), data=onofri.winterwheat, method="REML",
  covStruct=list(corrFamily = list(Cf=diagf, tpar=rep(1,8))), 
  fixed=list(lambda=1),            # Don't forget to fix this redundant parameter!
  # init=list(corrPars=list("1"=rep(log(O.1),8))), # 'init' optional 
  lower=list(corrPars=list("1"=rep(log(1e-6),8))), # 'lower' and 'upper' required
  upper=list(corrPars=list("1"=rep(log(1e6),8)))))

exp(get_ranPars(by_cF)$corrPars[[1]]) # fitted variances 
  
}

[Package spaMM version 4.5.0 Index]