R: Influential Error Detection

sel.edit {SeleMix}

R Documentation

Influential Error Detection

Description

Computes the score function and identifies influential errors

Usage

       sel.edit (y, ypred, wgt=rep(1,nrow(as.matrix(y ))), 
                 tot=colSums(ypred * wgt), t.sel=0.01)

Arguments

`y`	matrix or data frame containing the response variables
`ypred`	matrix of predicted values for y variables
`wgt`	optional vector of sampling weights (default=1)
`tot`	optional vector containing reference estimates of totals for the y variables. If omitted, it is computed as the (possibly weighted) sum of predicted values
`t.sel`	optional vector of threshold values, one for each variable, for selective editing (default=0.01)

Details

This function ranks observations (rank) according to the importance of their potential errors. The order is made with respect to the global score function values (global.score). The function also selects the units to be edited (sel) so that the expected residual error of all variables is below a prefixed level of accuracy (t.sel). The global score (global.score) is the maximum of the local scores computed for each variable (y1.score, y2.score,...). The local scores are defined as a weighted (weights) absolute difference between the observed (y1, y2,...) and the predicted values (y1.p, y2.p,...) standardised with respect to the reference total estimates (tot).

The selection of the units to be edited because affected by an influential error (sel=1) is made according to a two-step algorithm:
1) order the observations with respect to the global.score (decreasing order);
2) select the first k units such that, from the (k+1)th to the last observation, all the residual errors (y1.reserr, y2.reserr,...) for each variable are below t.sel.

The function provides also an indicator function (y1.sel, y2.sel,...) reporting which variables contain an influential errors in a unit selected for the revision.

Value

sel.edit returns a data matrix containing the following columns:

`y1`, `y2`, `...`	observed variables
`y1.p`, `y2.p`, `...`	predictions of y variables
`weights`	sampling weights
`y1.score`, `y2.score`, `...`	local scores
`global.score`	global score
`y1.reserr`, `y2.reserr`, `...`	residual errors
`y1.sel`, `y2.sel`, `...`	influential error flags
`rank`	rank according to global score
`sel`	1 if the observation contains an influential error, 0 otherwise

Author(s)

M. Teresa Buglielli <bugliell@istat.it>, Ugo Guarnera <guarnera@istat.it>

References

Di Zio, M., Guarnera, U. (2013) "A Contamination Model for Selective Editing", Journal of Official Statistics. Volume 29, Issue 4, Pages 539-555 (http://dx.doi.org/10.2478/jos-2013-0039).

Buglielli, M.T., Di Zio, M., Guarnera, U. (2010) "Use of Contamination Models for Selective Editing", European Conference on Quality in Survey Statistics Q2010, Helsinki, 4-6 May 2010.

Examples

# Example 1
# Parameter estimation with one contaminated variable and one covariate
    data(ex1.data)
    ml.par <- ml.est(y=ex1.data[,"Y1"], x=ex1.data[,"X1"])
# Detection of influential errors    
    sel <- sel.edit(y=ex1.data[,"Y1"], ypred=ml.par$ypred)
    head(sel)
    sum(sel[,"sel"])
# orders results for decreasing importance of score     
    sel.ord <- sel[order(sel[,"rank"]),  ] 
# adds columns to data
    ex1.data <- cbind(ex1.data, tau=ml.par$tau, outlier=ml.par$outlier,
                      sel[,c("rank", "sel")])
# plot of data with outliers and influential errors 
    sel.pairs(ex1.data[,c("X1","Y1")],outl=ml.par$outlier, sel=sel[,"sel"])
# Example 2
    data(ex2.data)
    par.joint <- ml.est(y=ex2.data)
    sel <- sel.edit(y=ex2.data, ypred=par.joint$ypred)	
    sel.pairs(ex2.data,outl=par.joint$outlier, sel=sel[,"sel"])

[Package SeleMix version 1.0.2 Index]