cluster.diagnostic {MVR} | R Documentation |
Function for Plotting Summary Cluster Diagnostic Plots
Description
Plot similarity statistic profiles and the optimal joint clustering configuration for the means and the variances by group.
Plot quantile profiles of means and standard deviations by group and for each clustering configuration, to check that the distributions of first and second moments of the MVR-transformed data approach their respective null distributions under the optimal configuration found, assuming independence and normality of all the variables.
Usage
cluster.diagnostic(obj,
span = 0.75,
degree = 2,
family = "gaussian",
title = "Cluster Diagnostic Plots",
device = NULL,
file = "Cluster Diagnostic Plots",
path = getwd(),
horizontal = FALSE,
width = 8.5,
height = 11, ...)
Arguments
obj |
Object of class " |
title |
Title of the plot. Defaults to "Cluster Diagnostic Plots". |
span |
Span parameter of the |
degree |
Degree parameter of the |
family |
Family distribution in "gaussian", "symmetric" of the |
device |
Graphic display device in {NULL, "PS", "PDF"}. Defaults to NULL (standard output screen). Currently implemented graphic display devices are "PS" (Postscript) or "PDF" (Portable Document Format). |
file |
File name for output graphic. Defaults to "Cluster Diagnostic Plots". |
path |
Absolute path (without final (back)slash separator). Defaults to working directory path. |
horizontal |
|
width |
|
height |
|
... |
Generic arguments passed to other plotting functions. |
Details
In a plot of a similarity statistic profile, one checks the goodness of fit of the transformed data relative to the hypothesized underlying reference
distribution with mean-0 and standard deviation-1 (e.g. N(0, 1)
). The red dashed line depicts the LOESS scatterplot smoother estimator.
The subroutine internally generates reference null distributions for computing the similarity statistic under each cluster configuration.
The optimal cluster configuration (indicated by the vertical red arrow) is found where the similarity statistic reaches its minimum plus/minus
one standard deviation (applying the conventional one-standard deviation rule). A smaller cluster number configuration indicates under-regularization,
while over-regularization starts to occur at larger numbers. This over/under-regularization must be viewed as a form of over/under-fitting
(see Dazard, J-E. and J. S. Rao (2012) for more details).
The quantile diagnostic plots uses empirical quantiles of the transformed means and standard deviations to check how
closely they are approximated by theoretical quantiles derived from a standard normal equal-mean/homoscedastic
model (solid green lines) under a given cluster configuration. To assess this goodness of fit of the transformed data, theoretical null distributions
of the mean and variance are derived from a standard normal equal-mean/homoscedastic model with independence of the first two moments,
i.e. assuming i.i.d. normality of the raw data. However, we do not require i.i.d. normality of the data in general: these theoretical null distributions are
just used here as convenient ones to draw from. Note that under the assumptions that the raw data is i.i.d. standard normal ($N(0, 1)$)
with independence of first two moments, the theoretical null distributions of means and standard deviations for each variable
are respectively: N(0, \frac{1}{n})
and \sqrt{\frac{\chi_{n - G}^{2}}{n - G}}
, where G
denotes the number of sample groups.
The optimal cluster configuration found is indicated by the most horizontal red curve. The single cluster configuration, corresponding to no transformation,
is the most vertical curve, while the largest cluster number configuration reaches horizontality. Notice how empirical quantiles of transformed
pooled means and standard deviations converge (from red to black) to the theoretical null distributions (solid green lines) for the optimal
configuration. One should see a convergence towards the target null, after which overfitting starts to occur (see Dazard, J-E. and J. S. Rao (2012)
for more details).
Both cluster diagnostic plots help determine (i) whether the minimum of the Similarity Statistic is observed within the range of clusters
(i.e. a large enough number of clusters has been accommodated), and (ii) whether the corresponding cluster configuration is a good fit.
If necessary, run the procedure again with larger value of the nc.max
parameter in the mvr
as well as
in mvrt.test
functions until the minimum of the similarity statistic profile is reached.
Option file
is used only if device is specified (i.e. non NULL
).
Value
None. Displays the plots on the chosen device
.
Acknowledgments
This work made use of the High Performance Computing Resource in the Core Facility for Advanced Research Computing at Case Western Reserve University. This project was partially funded by the National Institutes of Health (P30-CA043703).
Note
End-user function.
Author(s)
"Jean-Eudes Dazard, Ph.D." jean-eudes.dazard@case.edu
"Hua Xu, Ph.D." huaxu77@gmail.com
"Alberto Santana, MBA." ahs4@case.edu
Maintainer: "Jean-Eudes Dazard, Ph.D." jean-eudes.dazard@case.edu
References
Dazard J-E. and J. S. Rao (2010). "Regularized Variance Estimation and Variance Stabilization of High-Dimensional Data." In JSM Proceedings, Section for High-Dimensional Data Analysis and Variable Selection. Vancouver, BC, Canada: American Statistical Association IMS - JSM, 5295-5309.
Dazard J-E., Hua Xu and J. S. Rao (2011). "R package MVR for Joint Adaptive Mean-Variance Regularization and Variance Stabilization." In JSM Proceedings, Section for Statistical Programmers and Analysts. Miami Beach, FL, USA: American Statistical Association IMS - JSM, 3849-3863.
Dazard J-E. and J. S. Rao (2012). "Joint Adaptive Mean-Variance Regularization and Variance Stabilization of High Dimensional Data." Comput. Statist. Data Anal. 56(7):2317-2333.
See Also
loess
(R package stats) Fit a polynomial surface determined by one or more numerical predictors, using local fitting.
Examples
## Not run:
#===================================================
# Loading the library and its dependencies
#===================================================
library("MVR")
#===================================================
# MVR package news
#===================================================
MVR.news()
#================================================
# MVR package citation
#================================================
citation("MVR")
#===================================================
# Loading of the Synthetic and Real datasets
# (see description of datasets)
#===================================================
data("Synthetic", "Real", package="MVR")
?Synthetic
?Real
#===================================================
# Mean-Variance Regularization (Real dataset)
# Multi-Group Assumption
# Assuming unequal variance between groups
# Without cluster usage
===================================================
nc.min <- 1
nc.max <- 30
probs <- seq(0, 1, 0.01)
n <- 6
GF <- factor(gl(n = 2, k = n/2, length = n),
ordered = FALSE,
labels = c("M", "S"))
mvr.obj <- mvr(data = Real,
block = GF,
log = FALSE,
nc.min = nc.min,
nc.max = nc.max,
probs = probs,
B = 100,
parallel = FALSE,
conf = NULL,
verbose = TRUE,
seed = 1234)
#===================================================
# Summary Cluster Diagnostic Plots (Real dataset)
# Multi-Group Assumption
# Assuming unequal variance between groups
#===================================================
cluster.diagnostic(obj = mvr.obj,
title = "Cluster Diagnostic Plots
(Real - Multi-Group Assumption)",
span = 0.75,
degree = 2,
family = "gaussian",
device = NULL,
horizontal = FALSE,
width = 8.5,
height = 11)
## End(Not run)