R: Estimating mixed Vector Autoregressive Model (mVAR)

mvar {mgm}

R Documentation

Estimating mixed Vector Autoregressive Model (mVAR)

Description

Estimates mixed Vector Autoregressive Model (mVAR) via elastic-net regularized Generalized Linear Models

Usage

mvar(data, type, level, lambdaSeq, lambdaSel, lambdaFolds,
     lambdaGam, alphaSeq, alphaSel, alphaFolds, alphaGam, lags,
     consec, beepvar, dayvar, weights, threshold, method, binarySign, 
     scale, verbatim, pbar, warnings, saveModels, saveData, 
     overparameterize, thresholdCat, signInfo, ...)

Arguments

`data`	n x p data matrix.
`type`	p vector indicating the type of variable for each column in `data`. "g" for Gaussian, "p" for Poisson, "c" for categorical.
`level`	p vector indicating the number of categories of each variable. For continuous variables set to 1.
`lambdaSeq`	A sequence of lambdas that should be searched (see also `lambdaSel`). Defaults to `NULL`, which uses the `glmnet` default to select a lambda candidate sequence (recommended). See `?glmnet` for details.
`lambdaSel`	Specifies the procedure for selecting the tuning parameter controlling the Lq-penalization. The two options are cross validation "CV" and the Extended Bayesian Information Criterion (EBIC) "EBIC". The EBIC performs well in selecting sparse graphs (see Barber and Drton, 2010 and Foygel and Drton, 2014). Note that when also searching the alpha parameter in the elastic net penalty, cross validation should be preferred, as the parameter vector will not necessarily be sparse anymore. The EBIC tends to be a bit more conservative than CV (see Haslbeck and Waldorp, 2016). CV can sometimes not be performed with categorical variables, because `glmnet` requires at least 2 events of each category of each categorical variable in each training-fold. Defaults to `lambdaSel = "CV"`.
`lambdaFolds`	Number of folds in cross validation if `lambdaSel = "CV"`.
`lambdaGam`	Hyperparameter gamma in the EBIC if `lambdaSel = "EBIC"`. Defaults to `lambdaGam = .25`.
`alphaSeq`	A sequence of alpha parameters for the elastic net penality in [0,1] that should be searched (see also `alphaSel`). Defaults to `alphaSeq = 1`, which means that the lasso is being used. `alphaSeq = 0` corresponds to an L2-penalty (Ridge regression). For details see Friedman, Hastie and Tibshirani (2010).
`alphaSel`	Specifies the procedure for selecting the alpha parameter in the elastic net penalty. The two options are cross validation "CV" and the Extended Bayesian Information Criterion (EBIC) "EBIC". The EBIC performs well in selecting sparse graphs (see Barber and Drton, 2010 and Foygel and Drton, 2014). Note that when also searching the alpha parameter in the elastic net penalty, cross validation should be preferred, as the parameter vector will not necessarily be sparse anymore. The EBIC tends to be a bit more conservative than CV (see Haslbeck and Waldorp, 2016). CV can sometimes not be performed with categorical variables, because `glmnet` requires at least 2 events of each category of each categorical variable in each training-fold. Defaults to `alphaSel = "CV"`.
`alphaFolds`	Number of folds in cross validation if `alphaSel = "CV"`
`alphaGam`	Hyperparameter gamma in the EBIC if `alphaSel = "EBIC"`. Defaults to `alphaGam = .25`.
`lags`	Vector of positive integers indicating the lags included in the mVAR model (e.g. 1:3 or c(1,3,5))
`consec`	An integer vector of length n, indicating the consecutiveness of measurement points of the rows in `data`. This means that rows for which the necessary (defined by the specified VAR model) measurements at previous time points are not available are excluded from the analysis. For instance, for a VAR model with lag 1 a consec vector of `consec = c(1,2,3,5)` would mean that the fourth row is excluded from the analysis, since no measurement 5-1=4 is available (next to the first row, for which also no previous measurement can be available). This is useful in many applications in which measurements are missing randomly or due to the design of the data collection (for example, respondents only respond during the hours they are awake). The "trimmed" dataset is returned in `call$data_lagged` if `saveData = TRUE`. Defaults to `consec = NULL`, which assumes that all measurements are consecutive, i.e. `consec = 1:n`. In this case only the first `max(lags)` lags are excluded to obtain the VAR design matrix.
`beepvar`	Together with the argument `dayvar`, this argument is an alternative to the `consec` argument (see above) to specify the consecutiveness of measurements. This is tailored to experience sampling method (ESM) studies, where the consectutiveness is defined by the number of notification on a given day (`beepvar`) and the given day (`dayvar`).
`dayvar`	See `beepvar`.
`weights`	A vector with n - max(lags) entries, indicating the weight for each observation. The mVAR design matrix has with n - max(lags) rows, because the first row must be predictable by the highest lag. The weights have to be on the scale [0, n - max(lags) ].
`threshold`	A threshold below which edge-weights are put to zero. This is done in order to guarantee a lower bound on the false-positive rate. `threshold = "LW"` refers to the threshold in Loh and Wainwright (2013), which was used in all previous versions of mgm. `threshold = "HW"` refers to the threshold in Haslbeck and Waldorp (2016). If `threshold = "none"` no thresholding is applied. Defaults to `threshold = "LW"`.
`method`	Estimation method, currently only `method = "glm"`.
`binarySign`	If `binarySign = TRUE`, a sign for the interaction within binary nodes and between binary and continuous nodes is provided in the output. Note that in this case the two categories of the binary variables have to be coded in 0,1. This is to ensure that the interpretation of the sign is unambigous: a positive sign of a parameter means that increasing the associated predictor results in a higher probability for category 1.
`scale`	If `scale = TRUE`, all Gaussian nodes (specified by `"g"` in the `type` argument) are centered and divided by their standard deviation. Scaling is recommended, because otherwise the penalization of a parameter depends on the variance of the associated predictor.
`verbatim`	If `verbatim = TRUE`, no warnings and no progress bar is shown. Defaults to `verbatim = FALSE`.
`pbar`	If `pbar = TRUE`, a progress bar is shown. Defaults to `pbar = TRUE`.
`warnings`	If `warnings = TRUE`, no warnigns are returned. Defaults to `warnings = FALSE`.
`saveModels`	If `saveModels = FALSE`, only information about the weighted adjacency matrix, and if d > 1 about the factor graph is provided in the output list. If `saveModels = TRUE`, all fitted parameters are additionally returned.
`saveData`	If `saveData = TRUE`, the data is saved in the output list. Defaults to `saveData = FALSE`.
`overparameterize`	If `overparameterize = TRUE`, `mgm()` uses over-parameterized design-matrices for each neighborhood regression; this means that a cross-lagged effect between two categorical variables with m and s categories is parameterized by ms parameters. If `overparameterize = FALSE` the standard parameterization (in `glmnet`) with m(s-1) parameters is used, where the first category of the predicting variable serves as reference category. If all variables are continuous both parameterizations are the same. The default is set to `overparameterize = FALSE`.
`thresholdCat`	If `thresholdCat = FALSE`, the thresholds of categorical variables are set to zero. Defaults to `thresholdCat = TRUE` for which the thresholds are esimated.
`signInfo`	If `signInfo = TRUE`, a message is shown in the console, indicating that the sign of estimates is stored separately. Defaults to `signInfo = TRUE`.
`...`	Additional arguments.

Details

See Haslbeck and Waldorp (2018) for details about how the mixed VAR model is estimated.

Value

The function returns a list with the following entries:

`call`	Contains all provided input arguments. If `saveData = TRUE`, it also contains the data.
`wadj`	A p x p x n_lags array, in which rows are predicted by columns, i.e. entry `wadj[1, 2, 4]` corresponds to the parameter(s) of variable 2 at time point t predicting variable 1 at time point t - z, where z is the fourth specified lag in `lags` and n_lags is the number of specified lags in `lags`. For interactions that involve more than two parameters (e.g. always for categorical variables with more than 2 categories), we take the arithmetic mean of the absolute value of all parameters. The full set of estimated parameters is saved in `rawlags` (see below).
`signs`	A p x p x n_lags array, specifying the signs corresponding to the entries of `wadj` (if defined), where n_lags is the number of specified lags in `lags`. 1/-1 indicate positive and negative relationships, respectively. 0 indicates that no sign is defined, which is the case for interactions that involve a categorical variable where an interaction can have more than one parameter. If `binarySign = TRUE`, a sign is calculated for interactions between binary variables and binary and continuous variables, where the interaction is still defined by one parameter and hence a sign can be specified. `NA` indicates that the corresponding parameter in `wadj` is zero.
`edgecolor`	A p x p x n_lags array of colors indicating the sign of each parameter. This array contains the same information is `signs` and is included for convenient plotting.
`rawlags`	List with entries equal to the number of specified lags in `lags`. Each entry is a nested list, with each p entries: the first level indicates the predicted variable, the second level the predictor variable. In case of categorical variables, interactions have more than one parameter.
`intercepts`	A list with p entries, which contain the intercept/thresholds for each node. In case a given node is categorical with m categories, there are m thresholds for this variable.
`nodemodels`	A list with p `glmnet()` models, from which all above output is computed. Also contains the coefficients `models` for the selected lambda and the applied tau threshold `tau`.

Author(s)

Jonas Haslbeck <jonashaslbeck@gmail.com>

References

Barber, R. F., & Drton, M. (2015). High-dimensional Ising model selection with Bayesian information criteria. Electronic Journal of Statistics, 9(1), 567-607.

Foygel, R., & Drton, M. (2010). Extended Bayesian information criteria for Gaussian graphical models. In Advances in neural information processing systems (pp. 604-612).

Friedman, J., Hastie, T., & Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of statistical software, 33(1), 1.

Haslbeck, J. M. B., & Waldorp, L. J. (2020). mgm: Estimating time-varying Mixed Graphical Models in high-dimensional Data. Journal of Statistical Software, 93(8), pp. 1-46. DOI: 10.18637/jss.v093.i08

Loh, P. L., & Wainwright, M. J. (2012, December). Structure estimation for discrete graphical models: Generalized covariance matrices and their inverses. In NIPS (pp. 2096-2104).

Yang, E., Baker, Y., Ravikumar, P., Allen, G. I., & Liu, Z. (2014, April). Mixed Graphical Models via Exponential Families. In AISTATS (Vol. 2012, pp. 1042-1050).

Examples


## Not run: 


## We generate data from a mixed VAR model and then recover the model using mvar()

# 1) Define mVAR model
p <- 6 # Six variables
type <- c("c", "c", "c", "c", "g", "g") # 4 categorical, 2 gaussians
level <- c(2, 2, 4, 4, 1, 1) # 2 categoricals with m=2, 2 categoricals with m=4, two continuous
max_level <- max(level)

lags <- c(1, 3, 9) # include lagged effects of order 1, 3, 9
n_lags <- length(lags)

# Specify thresholds
thresholds <- list()
thresholds[[1]] <- rep(0, level[1])
thresholds[[2]] <- rep(0, level[2])
thresholds[[3]] <- rep(0, level[3])
thresholds[[4]] <- rep(0, level[4])
thresholds[[5]] <- rep(0, level[5])
thresholds[[6]] <- rep(0, level[6])

# Specify standard deviations for the Gaussians
sds <- rep(NULL, p)
sds[5:6] <- 1

# Create coefficient array
coefarray <- array(0, dim=c(p, p, max_level, max_level, n_lags))

# a.1) interaction between continuous 5<-6, lag=3
coefarray[5, 6, 1, 1, 2] <- .4
# a.2) interaction between 1<-3, lag=1
m1 <- matrix(0, nrow=level[2], ncol=level[4])
m1[1,1:2] <- 1
m1[2,3:4] <- 1
coefarray[1, 3, 1:level[2], 1:level[4], 1] <- m1
# a.3) interaction between 1<-5, lag=9
coefarray[1, 5, 1:level[1], 1:level[5], 3] <- c(0, 1)


# 2) Sample
set.seed(1)
dlist <- mvarsampler(coefarray = coefarray,
                     lags = lags,
                     thresholds = thresholds,
                     sds = sds,
                     type = type,
                     level = level,
                     N = 200,
                     pbar = TRUE)

# 3) Recover
set.seed(1)
mvar_obj <- mvar(data = dlist$data,
                 type = type,
                 level = level,
                 lambdaSel = "CV",
                 lags = c(1, 3, 9),
                 signInfo = FALSE,
                 overparameterize = F)

# Did we recover the true parameters?
mvar_obj$wadj[5, 6, 2] # cross-lagged effect of 6 on 2 over lag lags[2]
mvar_obj$wadj[1, 3, 1] # cross-lagged effect of 3 on 1 over lag lags[1]
mvar_obj$wadj[1, 5, 3] # cross-lagged effect of 1 on 5 over lag lags[3]

# How to get the exact parameter estimates?
# Example: the full parameters for the crossed-lagged interaction of 2 on 1 over lag lags[1]
mvar_obj$rawlags[[1]][[1]][[2]] 

# 4) Predict / Compute nodewise Error
pred_mvar <- predict.mgm(mvar_obj, dlist$data)

head(pred_mvar$predicted) # first 6 rows of predicted values
pred_mvar$errors # Nodewise errors

# For more examples see https://github.com/jmbh/mgmDocumentation


## End(Not run)

[Package mgm version 1.2-14 Index]