R: Partition outlier probabilities

partProb {probout}

R Documentation

Partition outlier probabilities

Description

Assigns outlier probabilities to the partitions by fitting an exponential distribution to a nonparametric outlier statistic for simulated data or partition centroids.

Usage

partProb( simData, method = c("intrinsic","distance","logdensity","distdens",
          "density"), shrink = 1, nproj = 1000, seed = NULL)

Arguments

simData

Observations from a call to simData, which includes the partition centroids and (optionally) simulated data as well.

method

One of the following options:

`"intrinsic"`	:	outlier statistic applied to simulation data (centroids if no simulation)
`"distance"`	:	outlier statistic applied to distances between NN partitions
`"logdensity"`	:	outlier statistic applied to differences in log density between NN partitions
`"distdens"`	:	outlier statistic applied to a matrix consisting of the `"distance"` and `"logdensity"` values
`"density"`	:	outlier statistic applied to smallest/largest ratios of density between NN partitions

The default is to use the "intrinsic" method.

shrink

Shrinkage parameter for outlier detection data. The offsets from simData are scaled by this factor before adding them to the partition centroids as data for outlier detection. The default value is shrink = 1, so that no shrinkage is applied to simulation offsets.

nproj

If the data is multivariate or method = "distdens", the number of random projections to be used to obtain the outlier statistic.

seed

An optional integer argument to set.seed for reproducible outlier statistics. By default the current seed will be used. Reproducibility can also be achieved by calling set.seed before calling partProb.

Details

"logdensity" is generally prefered over "density", because negative values that are large in magniude of the logarithm of the density will not be numerically distinguishable as density values.

Value

A vector of probabilities for each partition, obtained by fitting an exponential distribution to the outlier statistic.

References

C. Fraley, Estimating Outlier Probabilities for Large Datasets, 2017.

Examples


 set.seed(0)

 lead <- leader(faithful)
 nlead <- length(lead[[1]]$partitions)

# repeat multiple times to account for randomness
 ntimes <- 100
 probs <- matrix( NA, nlead, ntimes)
 for (i in 1:ntimes) {
    probs[,i] <- partProb( simData(lead[[1]]), method = "distance")
 }

# median probability for each partition
 partprobs <- apply( probs, 1, median)

 quantile(probs)

# plot leaders with outlier probability > .95
 plot( faithful[,1], faithful[,2], pch = 16, cex = .5,
       main = "red : leaders with outlier probability > .95")
 out <- partprobs > .95
 l <- lead[[1]]$leaders
 points( faithful[l[out],1], faithful[l[out],2], pch = 8, cex = 1, col = "red")

[Package probout version 1.1.2 Index]