R: The OCLUST Algorithm

oclust {oclust}

R Documentation

The OCLUST Algorithm

Description

oclust is a trimming method in model-based clustering. It iterates over possible values for the number of outliers and returns the model parameters for the best model as determined by the minimum KL divergence. If kuiper=TRUE, oclust calculates an approximate p-value using the Kuiper test and stops the algorithm if the p-value exceeds the specified threhold.

Usage

oclust(
  X,
  maxO,
  G,
  grossOuts = NULL,
  modelNames = "VVV",
  mc.cores = 1,
  nmax = 1000,
  kuiper = FALSE,
  pval = 0.05,
  B = 100,
  verb = FALSE,
  scale = TRUE
)

Arguments

`X`	A matrix or data frame with n rows of observations and p columns
`maxO`	An upper bound for the number of outliers
`G`	The number of clusters
`grossOuts`	The indices of the initial outliers to remove. Default is NULL.
`modelNames`	The model to fit using the gpcm function in the mixture package. Default is "VVV" (unconstrained). If modelNames=NULL, all models are fitted for each subset at each iteration. The BIC chooses the best model for each subset.
`mc.cores`	Number of cores to use if running in parallel. Default is 1
`nmax`	Maximum number of iterations for each EM algorithm. Decreasing nmax may speed up the algorithm but lose precision in finding the log-likelihoods.
`kuiper`	A logical specifying whether to use the Kuiper test (Kuiper, 1960) to stop the algorithm when p-value exceeds the specified threshold. Default is FALSE.
`pval`	The p-value for the Kuiper test. Default is 0.05.
`B`	Number of samples to calculate the approximate p-value. Default is 100.
`verb`	A logical specifying whether to print the current iteration number. Default is FALSE
`scale`	A logical specifying whether to centre and scale the data. Default is TRUE

Details

Gross outlier indices can be found with the findGrossOuts function.

N. H. Kuiper, Tests concerning random points on a circle, in: Nederl. Akad. Wetensch. Proc. Ser. A, Vol. 63, 1960, pp. 38–47.

Value

oclust returns a list of class oclust with

`data`	A list containing the raw and scaled data
`numO`	The predicted number of outliers
`outliers`	The most likely outliers in the optimal solution in order of likelihood
`class`	The classification for the optimal solution
`model`	The model selected for the optimal solution
`G`	The number of clusters
`pi.g`	The group proportions for the optimal solution
`mu`	The cluster means for the optimal solution
`sigma`	The cluster variances for the optimal solution
`KL`	The KL divergence for each iteration, with the first value being for the initial dataset with the gross outliers removed
`allCand`	All outlier candidates in order of likelihood

Examples

## Not run: 
#simulate 4D dataset
library(mvtnorm)
set.seed(123)
data<-rbind(rmvnorm(250,rep(-3,4),diag(4)),
           rmvnorm(250,rep(3,4),diag(4)))
#add outliers
noisy<-simOuts(data=data,alpha=0.02,seed=123)

#Find gross outliers
findGrossOuts(X=noisy,minPts=10)

#Elbow between 5 and 10. Specify limits of graph
findGrossOuts(X=noisy,minPts=10,xlim=c(5,10))

#Elbow at 9
gross<-findGrossOuts(X=noisy,minPts=10,elbow=9)

#run algorithm
result<-oclust(X=noisy,maxO=15,G=2,grossOuts = gross,
modelNames = "EEE",mc.cores=1,nmax=50,kuiper=FALSE,
verb=TRUE,scale=TRUE)

## End(Not run)

[Package oclust version 0.2.0 Index]