oclust {oclust}R Documentation

The OCLUST Algorithm

Description

oclust is a trimming method in model-based clustering. It iterates over possible values for the number of outliers and returns the model parameters for the best model as determined by the minimum KL divergence. If kuiper=TRUE, oclust calculates an approximate p-value using the Kuiper test and stops the algorithm if the p-value exceeds the specified threhold.

Usage

oclust(
  X,
  maxO,
  G,
  grossOuts = NULL,
  modelNames = "VVV",
  mc.cores = 1,
  nmax = 1000,
  kuiper = FALSE,
  pval = 0.05,
  B = 100,
  verb = FALSE,
  scale = TRUE
)

Arguments

X

A matrix or data frame with n rows of observations and p columns

maxO

An upper bound for the number of outliers

G

The number of clusters

grossOuts

The indices of the initial outliers to remove. Default is NULL.

modelNames

The model to fit using the gpcm function in the mixture package. Default is "VVV" (unconstrained). If modelNames=NULL, all models are fitted for each subset at each iteration. The BIC chooses the best model for each subset.

mc.cores

Number of cores to use if running in parallel. Default is 1

nmax

Maximum number of iterations for each EM algorithm. Decreasing nmax may speed up the algorithm but lose precision in finding the log-likelihoods.

kuiper

A logical specifying whether to use the Kuiper test (Kuiper, 1960) to stop the algorithm when p-value exceeds the specified threshold. Default is FALSE.

pval

The p-value for the Kuiper test. Default is 0.05.

B

Number of samples to calculate the approximate p-value. Default is 100.

verb

A logical specifying whether to print the current iteration number. Default is FALSE

scale

A logical specifying whether to centre and scale the data. Default is TRUE

Details

Gross outlier indices can be found with the findGrossOuts function.

N. H. Kuiper, Tests concerning random points on a circle, in: Nederl. Akad. Wetensch. Proc. Ser. A, Vol. 63, 1960, pp. 38–47.

Value

oclust returns a list of class oclust with

data

A list containing the raw and scaled data

numO

The predicted number of outliers

outliers

The most likely outliers in the optimal solution in order of likelihood

class

The classification for the optimal solution

model

The model selected for the optimal solution

G

The number of clusters

pi.g

The group proportions for the optimal solution

mu

The cluster means for the optimal solution

sigma

The cluster variances for the optimal solution

KL

The KL divergence for each iteration, with the first value being for the initial dataset with the gross outliers removed

allCand

All outlier candidates in order of likelihood

Examples

## Not run: 
#simulate 4D dataset
library(mvtnorm)
set.seed(123)
data<-rbind(rmvnorm(250,rep(-3,4),diag(4)),
           rmvnorm(250,rep(3,4),diag(4)))
#add outliers
noisy<-simOuts(data=data,alpha=0.02,seed=123)

#Find gross outliers
findGrossOuts(X=noisy,minPts=10)

#Elbow between 5 and 10. Specify limits of graph
findGrossOuts(X=noisy,minPts=10,xlim=c(5,10))

#Elbow at 9
gross<-findGrossOuts(X=noisy,minPts=10,elbow=9)

#run algorithm
result<-oclust(X=noisy,maxO=15,G=2,grossOuts = gross,
modelNames = "EEE",mc.cores=1,nmax=50,kuiper=FALSE,
verb=TRUE,scale=TRUE)

## End(Not run)

[Package oclust version 0.2.0 Index]