gaussian_ppmx {ppmSuite} | R Documentation |
Function that fits Gaussian PPMx model
Description
gaussian_ppmx
is the main function used to fit Gaussian PPMx model.
Usage
gaussian_ppmx(y, X=NULL, Xpred=NULL,
meanModel=1,
cohesion=1,
M=1,
PPM = FALSE,
similarity_function=1,
consim=1,
calibrate=0,
simParms=c(0.0, 1.0, 0.1, 1.0, 2.0, 0.1, 1),
modelPriors=c(0, 100^2, 1, 1),
mh=c(0.5, 0.5),
draws=1100,burn=100,thin=1,
verbose=FALSE)
Arguments
y |
numeric vector for the response variable |
X |
a data frame whose columns consist of covariates that will be incorporated in the partition model. Those with class of "character" or "factor" will be treated as categorical covariates. All others will be treated as continuous covariates. |
Xpred |
a data frame containing covariate values for which out-of-sample predictions are desired. The format of and order of Xpred must be the same as that found in X. |
meanModel |
Type of mean model included in the likelihood that is to be used. Options are 1 or 2 with
|
cohesion |
Type of cohesion function to use in the PPMx prior. Options are 1 or 2 with
|
M |
Precision parameter. Default is 1. |
PPM |
Logical argument that indicates if the PPM or PPMx partition model should be employed. If PPM = FALSE, then an X matrix must be supplied. |
similarity_function |
Type of similarity function that is employed for the PPMx prior on partitions. Options are 1-4 with
|
consim |
If similarity_function is set to either 1 or 2, then consim specifies the type of marginal likelihood used as the similarity function. Options are 1 or 2 with
|
calibrate |
Indicates if the similarity should be calibrated. Options are 0-2 with
|
simParms |
Vector of parameter values employed in the similarity function of the PPMx. Entries of the vector correspond to
|
modelPriors |
Vector of prior parameter values for priors assigned to parameters of the Gaussian data model.
|
mh |
two dimensional vector containing values for tunning parameter associated with MH update for sigma2 and sigma20 |
draws |
number of MCMC iterates to be collected. default is 1100 |
burn |
number of MCMC iterates discared as burn-in. default is 100 |
thin |
number by which the MCMC chain is thinne. default is 1. Thin must be selected so that it is a multilple of (draws - thin) |
verbose |
Logical indicating if information regarding data and MCMC iterate should be printed to screen |
Details
This function is able to fit a Gaussian PPM or PPMx model as detailed in (Mueller, Quintana, and Rosner, 2011). The data model is a Gaussian distribution with cluster-specific means and variances. If meanModel = 2, then a “global” regression component is added to the mean. Conjugate priors are used for cluster-specific means while uniform priors are used for variance components. A variety of options associated with the similarity function of the PPMx are available. See Page, Quintana 2018; Mueller, Quintana, Rosner 2011 for more details.
If covariate matrix contains missing values, then the approach described in Page, Quintana, Mueller (2022) is automatically employed. Missing values must be denoted using "NA". Currently, NAs cannot be accommodated if a “global” regression is desired.
We recommend standardizing covariates so thay they have mean zero and standard deviation one. This makes the default values provided for the similarity function reasonable in most cases. If covariates are standardized and meanModel = 2 the regression coefficients are estimated on the original scale and are ordered such that the continuous covariates appear first and the categorical covariates come after.
The MCMC algorithm used to sample from the joint posterior distribution is based on algorithm 8 found in Neal 2000.
Value
The function returns a list containing arrays filled with MCMC iterates corresponding to model parameters and model fit metrics. In order to provide more detail, in what follows let
"T" - be the number of MCMC iterates collected,
"N" - be the number of observations,
"P" - be the number of predictions.
"C" - be the total number of covariates
The output list contains the following
mu - a matrix of dimension (T, N) containing MCMC iterates associated with each subjects mean parameter (mu*_c_i).
sig2 - a matrix of dimension (T, N) containing MCMC iterates associated with each subjects variance parameter (sigma2*_c_i)
beta - if meanModel = 2, then this is a matrix of dimension (T,C) containing MCMC iterates associated coefficients in the global regression
Si - a matrix of dimension (T, N) containing MCMC iterates assocated with each subjects cluster label.
mu0 - vector of length T containing MCMC iterates for mu0 parameter
sig20 - vector of length T containing MCMC iterates for sig20
nclus - vector of length T containing number of clusters at each MCMC iterate
like - a matrix of dimension (T, N) containing likelihood values at each MCMC iterate.
WAIC - scalar containing the WAIC value
lpml - scalar containing lpml value
fitted.values - a matrix of dimension (T, N) containing fitted (or in sample predictions) for each subject at each MCMC iterate
ppred - a matrix of dimension (T, P) containing out of sample preditions for each "new" subject at each MCMC iterate of the posterior predictive distribution
predclass - a matrix of dimension (T, P) containing MCMC iterates of cluster two which "new" subject is allocated
rbpred - a matrix of dimension (T, P) containing out of sample preditions for each "new" subject at each MCMC iterate based on the rao-blackwellized prediction
Examples
data(bear)
# plot length, sex, and weight of bears
ck <- c(4,3,2)
pairs(bear[,ck])
# response is weight
Y <- bear$weight
# Continuous Covariate is length of chest
# Categorical covariate is sex
X <- bear[,c("length", "sex")]
X$sex <- as.factor(X$sex)
# Randomly partition data into 44 training and 10 testing
set.seed(1)
trainObs <- sample(1:length(Y),44, replace=FALSE)
Ytrain <- Y[trainObs]
Ytest <- Y[-trainObs]
Xtrain <- X[trainObs,,drop=FALSE]
Xtest <- X[-trainObs,,drop=FALSE]
simParms <- c(0.0, 1.0, 0.1, 1.0, 2.0, 0.1)
modelPriors <- c(0, 100^2, 0.5*sd(Y), 100)
M <- 1.0
niter <- 100000
nburn <- 50000
nthin <- 50
nout <- (niter - nburn)/nthin
mh <- c(1,10)
# Run MCMC algorithm for Gaussian PPMx model
out1 <- gaussian_ppmx(y=Ytrain, X=Xtrain, Xpred=Xtest,
M=M, PPM=FALSE,
meanModel = 1,
similarity_function=1,
consim=1,
calibrate=0,
simParms=simParms,
modelPriors = modelPriors,
draws=niter, burn=nburn, thin=nthin,
mh=mh)
# plot a select few posterior distributions
plot(density(out1$mu[,1])) # first observation's mean
plot(density(out1$sig2[,1])) # first observation's variance
plot(table(out1$nc)/nout,type='h') # distribution
plot(density(out1$mu0), type='l')
plot(density(out1$sig20))
# The first partition iterate is used for plotting
# purposes only. We recommended using the salso
# R-package to estimate the partition based on Si
pairs(bear[trainObs,ck],col=out1$Si[1,], pch=out1$Si[1,])
# Compare fit and predictions when covariates are not included
# in the partition model. That is, refit data with PPM rather than PPMx
out2 <- gaussian_ppmx(y=Ytrain, X=Xtrain, Xpred=Xtest,
M=M, PPM=TRUE,
meanModel = 1,
similarity_function=1,
consim=1,
calibrate=0,
simParms=simParms,
modelPriors = modelPriors,
draws=niter, burn=nburn, thin=nthin,
mh=mh)
oldpar <- par(no.readonly = TRUE)
par(mfrow=c(1,2))
plot(Xtrain[,1], Ytrain, ylab="weight", xlab="length", pch=20)
points(Xtrain[,1], apply(out2$fitted,2,mean), col='red',pch=2, cex=1)
points(Xtrain[,1], apply(out1$fitted,2,mean), col='blue',pch=3, cex=1)
legend(x="topleft",legend=c("Observed","PPM","PPMx"),
col=c("black","red","blue", "green"),pch=c(20,2,3,4))
plot(Xtest[,1], Ytest, ylab="weight", xlab="length",pch=20)
points(Xtest[,1], apply(out2$ppred,2,mean), col='red',pch=2, cex=1)
points(Xtest[,1], apply(out1$ppred,2,mean), col='blue',pch=3, cex=1)
legend(x="topleft",legend=c("Observed","PPM","PPMx"),
col=c("black","red","blue","green"),pch=c(20,2,3,4))
par(oldpar)