CheckForOutliers {MetabolomicsBasics} | R Documentation |
CheckForOutliers.
Description
CheckForOutliers
will evaluate a numeric vector and check
if outliers within groups based on group mean \pm n \times sd
.
Usage
CheckForOutliers(
x = NULL,
group = NULL,
n_sd = 3,
method = c("idx", "logical", "dist")
)
Arguments
x |
Numeric vector. |
group |
Factor vector of length(x). |
n_sd |
Cutoff for outliers in E being mean(group)+-n_sd*sd(group) where group values are calculated without the outlier candidate. |
method |
Different variants of the result value. See details. |
Details
The numeric will be split by groups and each value will be evaluated
with respect to its distance to the group mean (calculated out of the other
values in the group). Distance here means the number of standard deviations
the value is off the group mean. With different choices of method
the output can be switched from the calculated fold-distances to a boolean
of length(x) or and Index vector giving the outliers directly (see examples).
Value
Depending on the selected method. See details.
Examples
set.seed(0)
x <- runif(10)
x[1] <- 2
group <- gl(2, 5)
CheckForOutliers(x, group, method = "dist")
CheckForOutliers(x, group, method = "logical")
CheckForOutliers(x, group, method = "idx")
graphics::par(mfrow = c(1, 2))
bg <- c(3, 2)[1 + CheckForOutliers(x, group, method = "logical")]
graphics::plot(x = as.numeric(group), y = x, pch = 21, cex = 3,
bg = bg, main = "n_sd=3", las = 1, xlim = c(0.5, 2.5))
bg <- c(3, 2)[1 + CheckForOutliers(x, group, n_sd = 4, method = "logical")]
graphics::plot(x = as.numeric(group), y = x, pch = 21, cex = 3,
bg = bg, main = "n_sd=4", las = 1, xlim = c(0.5, 2.5))
graphics::par(mfrow = c(1, 1))
# load raw data and sample description
raw <- MetabolomicsBasics::raw
sam <- MetabolomicsBasics::sam
# no missing data in this matrix
all(is.finite(raw))
# check for outliers (computing n-fold sd distance from group mean)
tmp <- apply(raw, 2, CheckForOutliers, group = sam$GT, method = "dist")
# plot a histogram of the observed distances
graphics::hist(tmp, breaks = seq(0, ceiling(max(tmp))), main = "n*SD from mean", xlab = "n")
# Calculate the amount of values exceeding five-sigma and compare with a standard gaussian
table(tmp > 5)
round(100 * sum(tmp > 5) / length(tmp), 2)
gauss <- CheckForOutliers(x = rnorm(prod(dim(raw))), method = "dist")
sapply(1:5, function(i) {
data.frame("obs" = sum(tmp > i), "gauss" = sum(gauss > i))
})
# compare a PCA w/wo outliers
RestrictedPCA(
dat = raw, sam = sam, use.sam = sam$GT %in% c("Mo17", "B73"), group.col = "GT",
fmod = "GT+Batch+Order", P = 1, sign.col = "GT", legend.x = NULL, text.col = "Batch", medsd = TRUE
)
raw_filt <- raw
raw_filt[tmp > 3] <- NA
RestrictedPCA(
dat = raw_filt, sam = sam, use.sam = sam$GT %in% c("Mo17", "B73"), group.col = "GT",
fmod = "GT+Batch+Order", P = 1, sign.col = "GT", legend.x = NULL, text.col = "Batch", medsd = TRUE
)