| EAdet {modi} | R Documentation |
Epidemic Algorithm for detection of multivariate outliers in incomplete survey data
Description
In EAdet an epidemic is started at a center of the data. The epidemic
spreads out and infects neighbouring points (probabilistically or deterministically).
The last points infected are outliers. After running EAdet an imputation
with EAimp may be run.
Usage
EAdet(
data,
weights,
reach = "max",
transmission.function = "root",
power = ncol(data),
distance.type = "euclidean",
maxl = 5,
plotting = TRUE,
monitor = FALSE,
prob.quantile = 0.9,
random.start = FALSE,
fix.start,
threshold = FALSE,
deterministic = TRUE,
rm.missobs = FALSE,
verbose = FALSE
)
Arguments
data |
a data frame or matrix with data. |
weights |
a vector of positive sampling weights. |
reach |
if |
transmission.function |
form of the transmission function of distance d:
|
power |
sets |
distance.type |
distance type in function |
maxl |
maximum number of steps without infection. |
plotting |
if |
monitor |
if |
prob.quantile |
if mads fail, take this quantile absolute deviation. |
random.start |
if |
fix.start |
force epidemic to start at a specific observation. |
threshold |
infect all remaining points with infection probability above
the threshold |
deterministic |
if |
rm.missobs |
set |
verbose |
more output with |
Details
The form and parameters of the transmission function should be chosen such that the
infection times have at least a range of 10. The default cutting point to decide on
outliers is the median infection time plus three times the mad of infection times.
A better cutpoint may be chosen by visual inspection of the cdf of infection times.
EAdet calls the function EA.dist, which passes the counterprobabilities
of infection (a n * (n - 1) / 2 size vector!) and three parameters (sample
spatial median index, maximal distance to nearest neighbor and transmission distance =
reach) as arguments to EAdet. The distances vector may be too large to be passed
as arguments. Then either the memory size must be increased. Former versions of the
code used a global variable to store the distances in order to save memory.
Value
EAdet returns a list whose first component output is a sub-list
with the following components:
sample.sizeNumber of observations
discarded.observationsIndices of discarded observations
missing.observationsIndices of completely missing observations
number.of.variablesNumber of variables
n.complete.recordsNumber of records without missing values
n.usable.recordsNumber of records with less than half of values missing (unusable observations are discarded)
mediansComponent wise medians
madsComponent wise mads
prob.quantileUse this quantile if mads fail, i.e. if one of the mads is 0
quantile.deviationsQuantile of absolute deviations
startStarting observation
transmission.functionInput parameter
powerInput parameter
maxlMaximum number of steps without infection
min.nn.distMaximal nearest neighbor distance
transmission.distanced0thresholdInput parameter
distance.typeInput parameter
deterministicInput parameter
number.infectedNumber of infected observations
cutpointCutpoint of infection times for outlier definition
number.outliersNumber of outliers
outliersIndices of outliers
durationDuration of epidemic
computation.timeElapsed computation time
initialisation.computation.timeElapsed computation time for standardisation and calculation of distance matrix
The further components returned by EAdet are:
infectedIndicator of infection
infection.timeTime of infection
outindIndicator of outliers
Author(s)
Beat Hulliger
References
Béguin, C. and Hulliger, B. (2004) Multivariate outlier detection in incomplete survey data: the epidemic algorithm and transformed rank correlations, JRSS-A, 167, Part 2, pp. 275-294.
See Also
EAimp for imputation with the Epidemic Algorithm.
Examples
data(bushfirem, bushfire.weights)
det.res <- EAdet(bushfirem, bushfire.weights)