oclust {oclust} | R Documentation |
The OCLUST Algorithm
Description
oclust is a trimming method in model-based clustering. It iterates over possible values for the number of outliers and returns the model parameters for the best model as determined by the minimum KL divergence. If kuiper=TRUE, oclust calculates an approximate p-value using the Kuiper test and stops the algorithm if the p-value exceeds the specified threhold.
Usage
oclust(
X,
maxO,
G,
grossOuts = NULL,
modelNames = "VVV",
mc.cores = 1,
nmax = 1000,
kuiper = FALSE,
pval = 0.05,
B = 100,
verb = FALSE,
scale = TRUE
)
Arguments
X |
A matrix or data frame with n rows of observations and p columns |
maxO |
An upper bound for the number of outliers |
G |
The number of clusters |
grossOuts |
The indices of the initial outliers to remove. Default is NULL. |
modelNames |
The model to fit using the gpcm function in the mixture package. Default is "VVV" (unconstrained). If modelNames=NULL, all models are fitted for each subset at each iteration. The BIC chooses the best model for each subset. |
mc.cores |
Number of cores to use if running in parallel. Default is 1 |
nmax |
Maximum number of iterations for each EM algorithm. Decreasing nmax may speed up the algorithm but lose precision in finding the log-likelihoods. |
kuiper |
A logical specifying whether to use the Kuiper test (Kuiper, 1960) to stop the algorithm when p-value exceeds the specified threshold. Default is FALSE. |
pval |
The p-value for the Kuiper test. Default is 0.05. |
B |
Number of samples to calculate the approximate p-value. Default is 100. |
verb |
A logical specifying whether to print the current iteration number. Default is FALSE |
scale |
A logical specifying whether to centre and scale the data. Default is TRUE |
Details
Gross outlier indices can be found with the findGrossOuts
function.
N. H. Kuiper, Tests concerning random points on a circle, in: Nederl. Akad. Wetensch. Proc. Ser. A, Vol. 63, 1960, pp. 38–47.
Value
oclust returns a list of class oclust with
data |
A list containing the raw and scaled data |
numO |
The predicted number of outliers |
outliers |
The most likely outliers in the optimal solution in order of likelihood |
class |
The classification for the optimal solution |
model |
The model selected for the optimal solution |
G |
The number of clusters |
pi.g |
The group proportions for the optimal solution |
mu |
The cluster means for the optimal solution |
sigma |
The cluster variances for the optimal solution |
KL |
The KL divergence for each iteration, with the first value being for the initial dataset with the gross outliers removed |
allCand |
All outlier candidates in order of likelihood |
Examples
## Not run:
#simulate 4D dataset
library(mvtnorm)
set.seed(123)
data<-rbind(rmvnorm(250,rep(-3,4),diag(4)),
rmvnorm(250,rep(3,4),diag(4)))
#add outliers
noisy<-simOuts(data=data,alpha=0.02,seed=123)
#Find gross outliers
findGrossOuts(X=noisy,minPts=10)
#Elbow between 5 and 10. Specify limits of graph
findGrossOuts(X=noisy,minPts=10,xlim=c(5,10))
#Elbow at 9
gross<-findGrossOuts(X=noisy,minPts=10,elbow=9)
#run algorithm
result<-oclust(X=noisy,maxO=15,G=2,grossOuts = gross,
modelNames = "EEE",mc.cores=1,nmax=50,kuiper=FALSE,
verb=TRUE,scale=TRUE)
## End(Not run)