checkDataSet {cellWise} | R Documentation |
Clean the dataset
Description
This function checks the dataset X, and sets aside certain
columns and rows that do not satisfy the conditions.
It is used by the DDC
and MacroPCA
functions but can be used by itself, to clean a dataset for a different type of analysis.
Usage
checkDataSet(X, fracNA = 0.5, numDiscrete = 3, precScale = 1e-12, silent = FALSE,
cleanNAfirst = "automatic")
Arguments
X |
|
fracNA |
Only retain columns and rows with fewer NAs than this fraction.
Defaults to |
numDiscrete |
A column that takes on numDiscrete or fewer values
will be considered discrete and not retained in the cleaned data.
Defaults to |
precScale |
Only consider columns whose scale is larger than precScale.
Here scale is measured by the median absolute deviation.
Defaults to |
silent |
Whether or not the function progress messages should be printed.
Defaults to |
cleanNAfirst |
If |
Value
A list with components:
colInAnalysis
Column indices of the columns used in the analysis.rowInAnalysis
Row indices of the rows used in the analysis.namesNotNumeric
Names of the variables which are not numeric.namesCaseNumber
The name of the variable(s) which contained the case numbers and was therefore removed.namesNAcol
Names of the columns left out due to too manyNA
's.namesNArow
Names of the rows left out due to too manyNA
's.namesDiscrete
Names of the discrete variables.namesZeroScale
Names of the variables with zero scale.remX
Remaining (cleaned) data after checkDataSet.
Author(s)
Rousseeuw P.J., Van den Bossche W.
References
Rousseeuw, P.J., Van den Bossche W. (2018). Detecting Deviating Data Cells. Technometrics, 60(2), 135-145. (link to open access pdf)
See Also
Examples
library(MASS)
set.seed(12345)
n <- 100; d = 10
A <- matrix(0.9, d, d); diag(A) = 1
x <- mvrnorm(n, rep(0,d), A)
x[sample(1:(n * d), 100, FALSE)] <- NA
x <- cbind(1:n, x)
checkedx <- checkDataSet(x)
# For more examples, we refer to the vignette:
## Not run:
vignette("DDC_examples")
## End(Not run)