outlier_detection {robust2sls} | R Documentation |
Outlier detection algorithms
Description
outlier_detection
provides different types of outlier detection
algorithms depending on the arguments provided. The decision whether to
classify an observations as an outlier or not is based on its standardised
residual in comparison to some user-specified reference distribution.
The algorithms differ mainly in two ways. First, they can differ by the use
of initial estimator, i.e. the estimator based on which the first
classification as outliers is made. Second, the algorithm can either be
iterated a fixed number of times or until the difference in coefficient
estimates between the most recent model and the previous one is smaller than
some user-specified convergence criterion. The difference is measured by
the L2 norm.
Usage
outlier_detection(
data,
formula,
ref_dist = c("normal"),
sign_level,
initial_est = c("robustified", "saturated", "user", "iis"),
user_model = NULL,
iterations = 1,
convergence_criterion = NULL,
max_iter = NULL,
shuffle = FALSE,
shuffle_seed = NULL,
split = 0.5,
verbose = FALSE,
iis_args = NULL
)
Arguments
data |
A dataframe. |
formula |
A formula for the |
ref_dist |
A character vector that specifies the reference distribution
against which observations are classified as outliers. |
sign_level |
A numeric value between 0 and 1 that determines the cutoff in the reference distribution against which observations are judged as outliers or not. |
initial_est |
A character vector that specifies the initial estimator
for the outlier detection algorithm. |
user_model |
A model object of class ivreg. Only
required if argument |
iterations |
Either an integer >= 0 that specifies how often the outlier
detection algorithm is iterated, or the character vector
|
convergence_criterion |
A numeric value or NULL. The algorithm stops as
soon as the difference in coefficient estimates between the most recent model
and the previous one is smaller than |
max_iter |
A numeric value >= 1 or NULL. If
|
shuffle |
A logical value or |
shuffle_seed |
An integer value that will set the seed for shuffling the
sample or |
split |
A numeric value strictly between 0 and 1 that determines in which proportions the sample will be split. |
verbose |
A logical value whether progress during estimation should be reported. |
iis_args |
A list with named entries corresponding to the arguments for
|
Value
outlier_detection
returns an object of class
"robust2sls"
, which is a list with the following components:
$cons
A list which stores high-level information about the function call and some results.
$call
is the captured function call,$formula
the formula argument,$data
the original data set,$reference
the chosen reference distribution to classify outliers,$sign_level
the significance level,$psi
the probability that an observation is not classified as an outlier under the null hypothesis of no outliers,$cutoff
the cutoff used to classify outliers if their standardised residuals are larger than that value,$bias_corr
a bias correction factor to account for potential false positives (observations classified as outliers even though they are not). There are three further elements that are lists themselves.
$initial
stores settings about the initial estimator:$estimator
is the type of the initial estimator (e.g. robustified or saturated),$split
how the sample is split (NULL
if argument not used),$shuffle
whether the sample is shuffled before splitting (NULL
if argument not used),$shuffle_seed
the value of the random seed (NULL
if argument not used).
$convergence
stores information about the convergence of the outlier-detection algorithm:$criterion
is the user-specified convergence criterion (NULL
if argument not used),$difference
is the L2 norm between the last coefficient estimates and the previous ones (NULL
if argument not used or only initial estimator calculated).$converged
is a logical value indicating whether the algorithm has converged, i.e. whether the difference is smaller than the convergence criterion (NULL
if argument not used).$max_iter
is the maximum iteration set by the user (NULL
if argument not used or not set).
$iterations
contains information about the user-specified iterations argument ($setting
) and the actual number of iterations that were done ($actual
). The actual number can be lower if the algorithm converged already before the user-specified number of iterations were reached.$model
A list storing the model objects of class ivreg for each iteration. Each model is stored under
$m0
,$m1
, ...$res
A list storing the residuals of all observations for each iteration. Residuals of observations where any of the y, x, or z variables used in the 2SLS model are missing are set to NA. Each vector is stored under
$m0
,$m1
, ...$stdres
A list storing the standardised residuals of all observations for each iteration. Standardised residuals of observations where any of the y, x, or z variables used in the 2SLS model are missing are set to NA. Standardisation is done by dividing by sigma, which is not adjusted for degrees of freedom. Each vector is stored under
$m0
,$m1
, ...$sel
A list of logical vectors storing whether an observation is included in the estimation or not. Observations are excluded (FALSE) if they either have missing values in any of the x, y, or z variables needed in the model or when they are classified as outliers based on the model. Each vector is stored under
$m0
,$m1
, ...$type
A list of integer vectors indicating whether an observation has any missing values in x, y, or z (
-1
), whether it is classified as an outlier (0
) or not (1
). Each vector is stored under$m0
,$m1
, ...
Warning
Check Jiao (2019)
(as well as forthcoming working paper in the future) about conditions on the
initial estimator that should be satisfied for the initial estimator when
using initial_est == "user"
(e.g. they have to be Op(1)).
IIS is a generalisation of Saturated 2SLS
with
multiple block search but no asymptotic theory exists for IIS.