gen_loc_outl {SpatialBSS} | R Documentation |
Contamination with Local Outliers
Description
Generates synthetic local outliers and contaminates a given p-variate random field by swapping observations based on the first principal component score.
Usage
gen_loc_outl(x, coords, alpha = 0.05,
neighborhood_type = c("radius", "fixed_n"),
radius = NULL,
neighborhood_size = NULL,
swap_order = c("regular", "reverse", "random"))
Arguments
x |
a numeric matrix of dimension |
coords |
a numeric matrix or data frame with dimension |
alpha |
a numeric value between 0 and 1 determining the proportion of the contaminated observations. |
neighborhood_type |
a string determining the type of neighborhood. If |
radius |
a positive numeric value defining the size of the radius when the |
neighborhood_size |
a positive integer defining the number of points in each neighborhood when the |
swap_order |
a string to determine which swap order is used. Either |
Details
gen_loc_outl
generates local outliers by swapping the most extreme and the least extreme observations based on the first principal component score under the condition that at most one outliers lies in each neighborhood. For each location , the neighborhood
is defined based on the parameter
neighborhood_type
. When neighborhood_type
is 'radius'
, the neighborhood contains all locations
for which the Euclidean norm
, where
is determined by the parameter
radius
. When neighborhood_type
is 'fixed_n'
, the neighborhood contains
nearest locations of
, where
is determined by the parameter
neighborhood_size
. For more details see Ernst & Haesbroeck, (2017).
After calculating the neighborhoods, the local outliers are generated following Ernst & Haesbroeck, (2017) and Harris et al. (2014) using the steps:
-
Sort the observations from highest to lowest by their principle component analysis (PCA) scores of the first component (PC-1).
-
Set
to be
rounded to nearest integer and select the set of local outlier points
by finding
observations with the highest PC-1 values and
observations with the lowest PC-1 values under the condition that for all
it holds that
.
-
Form sets
, which contains
observations with the largest PC-1 values of outlier points
and
, which contains
observations with the smallest PC-1 values of outlier points
. Generate the local outliers by swapping
with
,
. The parameter
swap_order
defines how the setsand
are ordered.
If the parameter swap_order
is 'regular'
, and
are sorted by PC-1 score from smallest to largest.
If the parameter
swap_order
is 'reverse'
, is sorted from largest to smallest and
from smallest to largest.
If the parameter
swap_order
is 'random'
, and
are in random order.
Value
gen_loc_outl
returns a data.frame
containing the contaminated fields as first columns. The column
contains a logical indicator whether the observation is an outlier or not.
Note
This function is a modified version of code originally provided by M. Ernst and G. Haesbroeck.
References
Ernst, M., & Haesbroeck, G. (2017). Comparison of local outlier detection techniques in spatial multivariate data. Data Mining and Knowledge Discovery, 31 , 371-399. doi:10.1007/s10618-016-0471-0
Harris, P., Brunsdon, C., Charlton, M., Juggins, S., & Clarke, A. (2014). Multivariate spatial outlier detection using robust geographically weighted methods. Mathematical Geosciences, 46 , 1-31. doi:10.1007/s11004-013-9491-0
See Also
Examples
# simulate coordinates
coords <- runif(1000 * 2) * 20
dim(coords) <- c(1000, 2)
coords_df <- as.data.frame(coords)
names(coords_df) <- c("x", "y")
# simulate random field
if (!requireNamespace('gstat', quietly = TRUE)) {
message('Please install the package gstat to run the example code.')
} else {
library(gstat)
model_1 <- gstat(formula = z ~ 1, locations = ~ x + y, dummy = TRUE, beta = 0,
model = vgm(psill = 0.025, range = 1, model = 'Exp'), nmax = 20)
model_2 <- gstat(formula = z ~ 1, locations = ~ x + y, dummy = TRUE, beta = 0,
model = vgm(psill = 0.025, range = 1, kappa = 2, model = 'Mat'),
nmax = 20)
model_3 <- gstat(formula = z ~ 1, locations = ~ x + y, dummy = TRUE, beta = 0,
model = vgm(psill = 0.025, range = 1, model = 'Gau'), nmax = 20)
field_1 <- predict(model_1, newdata = coords_df, nsim = 1)$sim1
field_2 <- predict(model_2, newdata = coords_df, nsim = 1)$sim1
field_3 <- predict(model_3, newdata = coords_df, nsim = 1)$sim1
field <- cbind(field_1, field_2, field_3)
# Generate 5 % local outliers to data using radius neighborhoods
# and regular swap_order.
field_cont <- gen_loc_outl(field, coords, alpha = 0.05,
neighborhood_type = "radius",
radius = 0.5, swap_order = "regular")
# Generate 10 % local outliers to data using fixed_n neighborhoods
# and reverse swap_order.
field_cont2 <- gen_loc_outl(field, coords, alpha = 0.1,
neighborhood_type = "fixed_n",
neighborhood_size = 10, swap_order = "reverse")
}