fadalara_no_paral {adamethods} | R Documentation |
Functional non-parallel archetypoid algorithm for large applications (FADALARA)
Description
The FADALARA algorithm is based on the CLARA clustering algorithm. This is the non-parallel version of the algorithm. It allows to detect anomalies (outliers). In the univariate case, there are two different methods to detect them: the adjusted boxplot (default and most reliable option) and tolerance intervals. In the multivariate case, only adjusted boxplots are used. If needed, tolerance intervals allow to define a degree of outlierness.
Usage
fadalara_no_paral(data, seed, N, m, numArchoid, numRep, huge, prob, type_alg = "fada",
compare = FALSE, verbose = TRUE, PM, vect_tol = c(0.95, 0.9, 0.85),
alpha = 0.05, outl_degree = c("outl_strong", "outl_semi_strong",
"outl_moderate"), method = "adjbox", multiv, frame)
Arguments
data |
Data matrix. Each row corresponds to an observation and each column corresponds to a variable (temporal point). All variables are numeric. The data must have row names so that the algorithm can identify the archetypoids in every sample. |
seed |
Integer value to set the seed. This ensures reproducibility. |
N |
Number of samples. |
m |
Sample size of each sample. |
numArchoid |
Number of archetypes/archetypoids. |
numRep |
For each |
huge |
Penalization added to solve the convex least squares problems. |
prob |
Probability with values in [0,1]. |
type_alg |
String. Options are 'fada' for the non-robust fadalara algorithm, whereas 'fada_rob' is for the robust fadalara algorithm. |
compare |
Boolean argument to compute the robust residual sum of squares
if |
verbose |
Display progress? Default TRUE. |
PM |
Penalty matrix obtained with |
vect_tol |
Vector the tolerance values. Default c(0.95, 0.9, 0.85).
Needed if |
alpha |
Significance level. Default 0.05. Needed if |
outl_degree |
Type of outlier to identify the degree of outlierness.
Default c("outl_strong", "outl_semi_strong", "outl_moderate").
Needed if |
method |
Method to compute the outliers. Options allowed are 'adjbox' for
using adjusted boxplots for skewed distributions, and 'toler' for using
tolerance intervals.
The tolerance intervals are only computed in the univariate case, i.e.,
|
multiv |
Multivariate (TRUE) or univariate (FALSE) algorithm. |
frame |
Boolean value to indicate whether the frame is computed (Mair et al., 2017) or not. The frame is made up of a subset of extreme points, so the archetypoids are only computed on the frame. Low frame densities are obtained when only small portions of the data were extreme. However, high frame densities reduce this speed-up. |
Value
A list with the following elements:
cases Vector of archetypoids.
rss Optimal residual sum of squares.
outliers: Vector of outliers.
alphas: Matrix with the alpha coefficients.
local_rel_imp Matrix with the local (casewise) relative importance (in percentage) of each variable for the outlier identification. Only for the multivariate case. It is relative to the outlier observation itself. The other observations are not considered for computing this importance. This procedure works because the functional variables are in the same scale, after standardizing. Otherwise, it couldn't be interpreted like that.
margi_rel_imp Matrix with the marginal relative importance of each variable (in percentage) for the outlier identification. Only for the multivariate case. In this case, the other points are considered, since the value of the outlier observation is compared with the remaining points.
Author(s)
Guillermo Vinue, Irene Epifanio
References
Epifanio, I., Functional archetype and archetypoid analysis, 2016. Computational Statistics and Data Analysis 104, 24-34, https://doi.org/10.1016/j.csda.2016.06.007
Hubert, M. and Vandervieren, E., An adjusted boxplot for skewed distributions, 2008. Computational Statistics and Data Analysis 52(12), 5186-5201, https://doi.org/10.1016/j.csda.2007.11.008
Kaufman, L. and Rousseeuw, P.J., Clustering Large Data Sets, 1986. Pattern Recognition in Practice, 425-437.
Mair, S., Boubekki, A. and Brefeld, U., Frame-based Data Factorizations, 2017. Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, 1-9.
Moliner, J. and Epifanio, I., Robust multivariate and functional archetypal analysis with application to financial time series analysis, 2019. Physica A: Statistical Mechanics and its Applications 519, 195-208. https://doi.org/10.1016/j.physa.2018.12.036
See Also
Examples
## Not run:
library(fda)
?growth
str(growth)
hgtm <- growth$hgtm
hgtf <- growth$hgtf[,1:39]
# Create array:
nvars <- 2
data.array <- array(0, dim = c(dim(hgtm), nvars))
data.array[,,1] <- as.matrix(hgtm)
data.array[,,2] <- as.matrix(hgtf)
rownames(data.array) <- 1:nrow(hgtm)
colnames(data.array) <- colnames(hgtm)
str(data.array)
# Create basis:
nbasis <- 10
basis_fd <- create.bspline.basis(c(1,nrow(hgtm)), nbasis)
PM <- eval.penalty(basis_fd)
# Make fd object:
temp_points <- 1:nrow(hgtm)
temp_fd <- Data2fd(argvals = temp_points, y = data.array, basisobj = basis_fd)
X <- array(0, dim = c(dim(t(temp_fd$coefs[,,1])), nvars))
X[,,1] <- t(temp_fd$coef[,,1])
X[,,2] <- t(temp_fd$coef[,,2])
# Standardize the variables:
Xs <- X
Xs[,,1] <- scale(X[,,1])
Xs[,,2] <- scale(X[,,2])
# We have to give names to the dimensions to know the
# observations that were identified as archetypoids.
dimnames(Xs) <- list(paste("Obs", 1:dim(hgtm)[2], sep = ""),
1:nbasis,
c("boys", "girls"))
n <- dim(Xs)[1]
# Number of archetypoids:
k <- 3
numRep <- 20
huge <- 200
# Size of the random sample of observations:
m <- 15
# Number of samples:
N <- floor(1 + (n - m)/(m - k))
N
prob <- 0.75
data_alg <- Xs
seed <- 2018
res_fl <- fadalara_no_paral(data = data_alg, seed = seed, N = N, m = m,
numArchoid = k, numRep = numRep, huge = huge,
prob = prob, type_alg = "fada_rob", compare = FALSE,
verbose = TRUE, PM = PM, method = "adjbox", multiv = TRUE,
frame = FALSE) # frame = TRUE
str(res_fl)
res_fl$cases
res_fl$rss
as.vector(res_fl$outliers)
## End(Not run)