UNPaC_num_clust {UNPaC}R Documentation

Unimodal Non-Parametric Cluster (UNPaC) Test for Estimating Number of Clusters

Description

UNPaC for estimating the number of clusters Compares the cluster index (CI) from the original data to that produced by clustering a simulated ortho-unimodal reference distribution generated using a Gaussian copula. The CI is defined to be the sum of the within-cluster sum of squares about the cluster means divided by the total sum of squares. The number of clusters is chosen to maximize the difference between the data cluster index and the reference cluster indices, but additional rules are also implemented (See below). This method is described in Helgeson, Vock, and Bair (2021).

Usage

UNPaC_num_clust(
  x,
  k = 10,
  cluster.fun,
  nsim = 1000,
  cov = "glasso",
  rho = 0.02,
  scale = FALSE,
  center = FALSE,
  var_selection = FALSE,
  p.adjust = "none",
  gamma = 0.1,
  d.power = 1
)

Arguments

x

a dataset with n observations (rows) and p features (columns)

k

maximum number of clusters considered. (default=10)

cluster.fun

function used to cluster data. Function should return list containing a component "cluster." Examples include kmeans and pam.

nsim

a numeric value specifying the number of unimodal reference distributions used for testing (default=1000)

cov

method used for approximating the covariance structure. options include: "glasso" (See huge), "banded" (See band.chol.cv) and "est" (default = "glasso")

rho

a regularization parameter used in implementation of the graphical lasso. See documentation for lambda in huge. Not used if cov="est" or cov="banded"

scale

should data be scaled such that each feature has variance equal to one prior to clustering (default=FALSE)

center

should data be centered such that each feature has mean equal to zero prior to clustering (default=TRUE)

var_selection

should dimension be reduced using feature filtering procedure? See description below. (default=FALSE)

p.adjust

p-value adjustment method for additional feature filtering. See p.adjust for options. (default="fdr"). Not used if p.adjust="none."

gamma

threshold for feature filtering procedure. See description below. Not used if var_selection=FALSE (default=0.10)

d.power

Power in estimating the low of the within cluster dispersion for comparison to the Gap statistic. See clusGap.

Details

There are three options for the covariance matrix used in generating the Gaussian copula: sample covariance estimation, cov="est", which should be used if n>p; the graphical lasso, cov="glasso", which should be used if n<p; and k-banded covariance, cov="banded", which can be used if n<p and it can be assumed that features farther away in the ordering have weaker covariance. The graphical lasso is implemented using the huge function. When cov="banded" is selected the k-banded covariance Cholesky factor of Rothman, Levina, and Zhu (2010) is used to estimate the covariance matrix. Cross-validation is used for selecting the banding parameter. See documentation in band.chol.cv.

In high dimensional (n<p) settings a dimension reduction step can be implemented which selects features based on an F-test for difference in means across clusters. Features having a p-value less than a threshold gamma are retained. For additional feature filtering a p-value adjustment procedure (such as p.adjust="fdr") can be used. If no features are retained the resulting p-value for the cluster significance test is given as 1.

Value

The function returns a list with the following components:

Author(s)

Erika S. Helgeson, David Vock, Eric Bair

References

Examples

	 test1 <- matrix(rnorm(100*50), nrow=100, ncol=50)
  test1[1:30,1:50] <- rnorm(30*50, 2)
  test.edit<-scale(test1,center=TRUE,scale=FALSE)
  UNPaC_k<-UNPaC_num_clust(test.edit,k=5,kmeans,nsim=100,cov="est")


[Package UNPaC version 1.1.1 Index]