gen_loc_outl {SpatialBSS}R Documentation

Contamination with Local Outliers

Description

Generates synthetic local outliers and contaminates a given p-variate random field by swapping observations based on the first principal component score.

Usage

gen_loc_outl(x, coords, alpha = 0.05, 
             neighborhood_type = c("radius", "fixed_n"), 
             radius = NULL, 
             neighborhood_size = NULL, 
             swap_order = c("regular", "reverse", "random"))

Arguments

x

a numeric matrix of dimension c(n, p) where the p columns correspond to the entries of the random field and the n rows are the observations.

coords

a numeric matrix or data frame with dimension c(n,2) containing the coordinates of the observations.

alpha

a numeric value between 0 and 1 determining the proportion of the contaminated observations.

neighborhood_type

a string determining the type of neighborhood. If 'radius', each neighborhood contains all points within the radius determined by the parameter radius. If 'fixed_n', each neighborhood contains a constant number of closest points, where the constant is determined by the parameter neighborhood_size. Default is 'radius'.

radius

a positive numeric value defining the size of the radius when the
neighborhood_type is 'radius'. If NULL the radius defaults as 0.01*n.

neighborhood_size

a positive integer defining the number of points in each neighborhood when the neighborhood_type is 'fixed_n'. If NULL the number of points defaults as ceiling(0.01*n).

swap_order

a string to determine which swap order is used. Either 'regular' (default), 'reverse' or 'random'. See details.

Details

gen_loc_outl generates local outliers by swapping the most extreme and the least extreme observations based on the first principal component score under the condition that at most one outliers lies in each neighborhood. For each location sis_i, the neighborhood NiN_i is defined based on the parameter neighborhood_type. When neighborhood_type is 'radius', the neighborhood NiN_i contains all locations sjs_j for which the Euclidean norm sisj<r||s_i - s_j|| < r, where rr is determined by the parameter radius. When neighborhood_type is 'fixed_n', the neighborhood NiN_i contains m1m - 1 nearest locations of sis_i, where mm is determined by the parameter neighborhood_size. For more details see Ernst & Haesbroeck, (2017).

After calculating the neighborhoods, the local outliers are generated following Ernst & Haesbroeck, (2017) and Harris et al. (2014) using the steps:

  1. Sort the observations from highest to lowest by their principle component analysis (PCA) scores of the first component (PC-1).

  2. Set kk to be αN/2\alpha N/2 rounded to nearest integer and select the set of local outlier points SoutS^{out} by finding kk observations with the highest PC-1 values and kk observations with the lowest PC-1 values under the condition that for all si,sjSouts_i, s_j \in S_{out} it holds that NiNjN_i \neq N_j.

  3. Form sets XlargeX_{large}, which contains kk observations with the largest PC-1 values of outlier points SoutS_{out} and XsmallX^{small}, which contains kk observations with the smallest PC-1 values of outlier points SoutS^{out}. Generate the local outliers by swapping Xsmall,iX^{small,i} with Xlarge,k+1iX^{large, k + 1 - i}, i=1,...,ki=1,...,k. The parameter swap_order defines how the sets XlargeX^{large} and XsmallX^{small} are ordered.

If the parameter swap_order is 'regular', XsmallX^{small} and XlargeX^{large} are sorted by PC-1 score from smallest to largest. If the parameter swap_order is 'reverse', XsmallX^{small} is sorted from largest to smallest and XlargeX^{large} from smallest to largest. If the parameter swap_order is 'random', XsmallX^{small} and XlargeX^{large} are in random order.

Value

gen_loc_outl returns a data.frame containing the contaminated fields as pp first columns. The column p+1p + 1 contains a logical indicator whether the observation is an outlier or not.

Note

This function is a modified version of code originally provided by M. Ernst and G. Haesbroeck.

References

Ernst, M., & Haesbroeck, G. (2017). Comparison of local outlier detection techniques in spatial multivariate data. Data Mining and Knowledge Discovery, 31 , 371-399. doi:10.1007/s10618-016-0471-0

Harris, P., Brunsdon, C., Charlton, M., Juggins, S., & Clarke, A. (2014). Multivariate spatial outlier detection using robust geographically weighted methods. Mathematical Geosciences, 46 , 1-31. doi:10.1007/s11004-013-9491-0

See Also

gen_glob_outl

Examples

# simulate coordinates
coords <- runif(1000 * 2) * 20
dim(coords) <- c(1000, 2)
coords_df <- as.data.frame(coords)
names(coords_df) <- c("x", "y")
# simulate random field
if (!requireNamespace('gstat', quietly = TRUE)) {
  message('Please install the package gstat to run the example code.')
} else {
  library(gstat)
  model_1 <- gstat(formula = z ~ 1, locations = ~ x + y, dummy = TRUE, beta = 0, 
                   model = vgm(psill = 0.025, range = 1, model = 'Exp'), nmax = 20)
  model_2 <- gstat(formula = z ~ 1, locations = ~ x + y, dummy = TRUE, beta = 0, 
                   model = vgm(psill = 0.025, range = 1, kappa = 2, model = 'Mat'), 
                   nmax = 20)
  model_3 <- gstat(formula = z ~ 1, locations = ~ x + y, dummy = TRUE, beta = 0, 
                   model = vgm(psill = 0.025, range = 1, model = 'Gau'), nmax = 20)
                   
  field_1 <- predict(model_1, newdata = coords_df, nsim = 1)$sim1
  field_2 <- predict(model_2, newdata = coords_df, nsim = 1)$sim1
  field_3 <- predict(model_3, newdata = coords_df, nsim = 1)$sim1
  field <- cbind(field_1, field_2, field_3)
  
  # Generate 5 % local outliers to data using radius neighborhoods 
  # and regular swap_order.
  field_cont <- gen_loc_outl(field, coords, alpha = 0.05,
                             neighborhood_type = "radius", 
                             radius = 0.5, swap_order = "regular")

  # Generate 10 % local outliers to data using fixed_n neighborhoods 
  # and reverse swap_order.
  field_cont2 <- gen_loc_outl(field, coords, alpha = 0.1, 
                              neighborhood_type = "fixed_n", 
                              neighborhood_size = 10, swap_order = "reverse")
}

[Package SpatialBSS version 0.14-0 Index]