R: Simulate or extend multivariate abundance data

extend.cord {ecopower}

R Documentation

Simulate or extend multivariate abundance data

Description

extend returns a simulated response matrix or a manyglm object with N observations and simulated response matrix that utilises the existing correlation structure of the data.

Usage

## S3 method for class 'cord'
extend(
  object,
  N = nrow(object$obj$data),
  coeffs = coef(object$obj),
  newdata = NULL,
  n_replicate = NULL,
  do.fit = FALSE,
  seed = NULL
)

extend(
  object,
  N = nrow(object$obj$data),
  coeffs = coef(object$obj),
  newdata = NULL,
  n_replicate = NULL,
  do.fit = FALSE,
  seed = NULL
)

Arguments

`object`	objects of class `cord`, typically the result of a call to `cord`.
`N`	Number of samples to be extended. Defaults to the number of observations in the original sample.
`coeffs`	Coefficient matrix for a `manyglm` object that characterises the size of effects to be simulated. See `effect_alt` for help in producing this matrix. Defaults to the coefficient matrix from the `cord` object, `coef(object$obj)`.
`newdata`	Data frame of same size as the original X covariates from the fitted `object`, that specifies a different design of interest. Defaults to `NULL`.
`n_replicate`	Number of unique replicates of the original data frame. Defaults to `NULL`, overwrites `N` if specified.
`do.fit`	Logical. If `TRUE`, fits a `manyglm` object from the simulated data. Defaults to `FALSE`.
`seed`	Random number seed, defaults to a random seed number.

Details

extend takes a cord object and returns a new simulated response matrix or an "extended" manyglm object with N observations and the new simulated response matrix. Response abundances are simulated through a Gaussian copula model that utilises a coefficient matrix coeffs, the specified cord model and the joint correlation structure exhibited between the response variables. To help with the specification of coeffs, see effect_alt which simplifies this process.

Response variables are simulated through a copula model by first extracting Gaussian copular scores as Dunn-Smyth residuals (Dunn & Smyth 1996), which are obtained from abundances y_{ij} with marginal distributions F_j which have been specified via the original manyglm model (fit.glm; see examples);

z_{ij} = \Phi^{-1}{F_{j}(y_{ij}^-) + u_{ij} f_{j}(y_{ij})}

These scores then follow a multivariate Gaussian distribution with zero mean and covariance structure \Sigma,

z_{ij} \sim N_p(0,\Sigma)

To avoid estimating a large number p(p-1)/2 pairwise correlations within \Sigma, factor analysis is utilised with two latent factor variables, which can be interpreted as an unobserved environmental covariate.

Thus, in order to simulate new multivariate abundances we simulate new copula scores and back transform them to abundances as y_{ij}= {F^*}_j^{-1}(\Phi(z_{ij})), where the coefficient matrix coeffs specifies the effect size within the new marginal distributions {F^*}_j.

The data frame is also extended in a manner that preserves the original design structure. This is done by first repeating the design matrix until the number of samples exceeds N, then randomly removing rows from the last repeated data frame until the number of samples equals N. Alternatively, a balanced design structure can be obtained by specifying the number of replicates.

newdata can be utilised if a different data frame is wanted for simulation.

If users are interested in obtaining a manyglm model, do.fit=TRUE can be used to obtain a manyglm object from the simulated responses.

Value

Simulated data or manyglm object.

Functions

extend(): Simulate or extend multivariate abundance data

References

Dunn, P.K., & Smyth, G.K. (1996). Randomized quantile residuals. Journal of Computational and Graphical Statistics 5, 236-244.

Examples

library(ecoCopula)
library(mvabund)
data(spider)
spiddat = mvabund(spider$abund)
X = data.frame(spider$x)

# Specify increasers and decreasers
increasers = c("Alopacce", "Arctlute", "Arctperi", "Pardnigr", "Pardpull")
decreasers = c("Alopcune", "Alopfabr", "Zoraspin")

# Simulate data
fit.glm = manyglm(spiddat~1, family="negative.binomial")
fit.cord = cord(fit.glm)
simData = extend(fit.cord)

# Simulate data with N=20
fit.glm = manyglm(spiddat~soil.dry, family="negative.binomial", data=X)
fit.cord = cord(fit.glm)
simData = extend(fit.cord, N=20)

# Obtain a manyglm fit from simulated data with N=10 and effect_size=1.5
X$Treatment = rep(c("A","B","C","D"),each=7)
fit_factors.glm = manyglm(spiddat~Treatment, family="negative.binomial", data=X)
effect_mat = effect_alt(fit_factors.glm, effect_size=1.5,
     increasers, decreasers, term="Treatment")
fit_factors.cord = cord(fit_factors.glm)
newFit.glm = extend(fit_factors.cord, N=10,
     coeffs=effect_mat, do.fit=TRUE)

# Change sampling design
X_new = X
X_new$Treatment[6:7] = c("B","B")
simData = extend(fit_factors.cord, N=NULL,
   coeffs=effect_mat, newdata=X_new, n_replicate=5)

[Package ecopower version 0.2.0 Index]