matrixNAneighbourImpute {wrProteo} | R Documentation |
Imputation of NA-values based on non-NA replicates
Description
It is assumed that NA
-values appear in data when quantitation values are very low (as this appears eg in quantitative shotgun proteomics).
Here, the concept of (technical) replicates is used to investigate what kind of values appear in the other replicates next to NA-values for the same line/protein.
Groups of replicate samples are defined via argument gr
which descibes the columns of dat
).
Then, they are inspected for each line to gather NA-neighbour values (ie those values where NAs and regular measures are observed the same time).
Eg, let's consider a line contains a set of 4 replicates for a given group. Now, if 2 of them are NA
-values, the remaining 2 non-NA
-values will be considered as NA-neighbours.
Ultimately, the aim is to replaces all NA
-values based on values from a normal distribution ressembling theire respective NA-neighbours.
Usage
matrixNAneighbourImpute(
dat,
gr,
imputMethod = "mode2",
retnNA = TRUE,
avSd = c(0.15, 0.5),
avSdH = NULL,
NAneigLst = NULL,
plotHist = c("hist", "mode"),
xLab = NULL,
xLim = NULL,
yLab = NULL,
yLim = NULL,
tit = NULL,
figImputDetail = TRUE,
seedNo = NULL,
silent = FALSE,
callFrom = NULL,
debug = FALSE
)
Arguments
dat |
(matrix or data.frame) main data (may contain |
gr |
(character or factor) grouping of columns of 'dat', replicate association |
imputMethod |
(character) choose the imputation method (may be 'mode2'(default), 'mode1', 'datQuant', 'modeAdopt' or 'informed') |
retnNA |
(logical) decide (if = |
avSd |
(numerical,length=2) population characteristics 'high' (mean and sd) for >1 |
avSdH |
depreciated, please use |
NAneigLst |
(list) option for repeated rounds of imputations: list of |
plotHist |
(character or logical) decide if supplemental figure with histogram shoud be drawn, the details 'Hist','quant' (display quantile of originak data), 'mode' (display mode of original data) can be chosen explicitely |
xLab |
(character) label on x-axis on plot |
xLim |
(numeric, length=2) custom x-axis limits |
yLab |
(character) label on y-axis on plot |
yLim |
(numeric, length=2) custom y-axis limits |
tit |
(character) title on plot |
figImputDetail |
(logical) display details about data (number of NAs) and imputation in graph (min number of NA-neighbours per protein and group, quantile to model, mean and sd of imputed) |
seedNo |
(integer) seed-value for normal random values |
silent |
(logical) suppress messages |
callFrom |
(character) allow easier tracking of messages produced |
debug |
(logical) supplemental messages for debugging |
Details
By default a histogram gets plotted showing the initial, imputed and final distribution to check the global hypothesis that NA
-values arose
from very low measurements and to appreciate the impact of the imputed values to the overall final distribution.
There are a number of experimental settings where low measurements may be reported as NA
.
Sometimes an arbitrary defined baseline (as 'zero') may provoke those values found below being unfortunately reported as NA
or as 0 (in case of MaxQuant).
In quantitative proteomics (DDA-mode) the presence of numerous high-abundance peptides will lead to the fact that a number of less
intense MS-peaks don't get identified properly and will then be reported as NA
in the respective samples,
while the same peptides may by correctly identified and quantified in other (replicate) samples.
So, if a given protein/peptide gets properly quantified in some replicate samples but reported as NA
in other replicate samples
one may thus speculate that similar values like in the successful quantifications may have occored.
Thus, imputation of NA
-values may be done on the basis of NA
-neighbours.
When extracting NA
-neighbours, a slightly more focussed approach gets checked, too, the 2-NA
-neighbours : In case a set of replicates for a given protein
contains at least 2 non-NA
-values (instead of just one) it will be considered as a (min) 2-NA
-neighbour as well as regular NA
-neighbour.
If >300 of these (min) 2-NA
-neighbours get found, they will be used instead of the regular NA
-neighbours.
For creating a collection of normal random values one may use directly the mode of the NA
-neighbours (or 2-NA
-neighbours, if >300 such values available).
To do so, the first value of argument avSd
must be set to NA
. Otherwise, the first value avSd
will be used as quantile of all data to define the mean
for the imputed data (ie as quantile(dat, avSd[1], na.rm=TRUE)
). The sd for generating normal random values will be taken from the sd of all NA
-neighbours (or 2-NA
-neighbours)
multiplied by the second value in argument avSd
(or avSd
, if >300 2-NA
-neighbours), since the sd of the NA
-neighbours is usually quite high.
In extremely rare cases it may happen that no NA
-neighbours are found (ie if NA
s occur, all replicates are NA
).
Then, this function replaces NA
-values based on the normal random values obtained as dscribed above.
Value
This function returns a list with $data
.. matrix of data where NA
are replaced by imputed values, $nNA
.. number of NA
by group, $randParam
.. parameters used for making random data
See Also
this function gets used by testRobustToNAimputation
; estimation of mode stableMode
; detection of NAs na.fail
Examples
set.seed(2013)
datT6 <- matrix(round(rnorm(300)+3,1), ncol=6, dimnames=list(paste("li",1:50,sep=""),
letters[19:24]))
datT6 <- datT6 +matrix(rep(1:nrow(datT6), ncol(datT6)), ncol=ncol(datT6))
datT6[6:7, c(1,3,6)] <- NA
datT6[which(datT6 < 11 & datT6 > 10.5)] <- NA
datT6[which(datT6 < 6 & datT6 > 5)] <- NA
datT6[which(datT6 < 4.6 & datT6 > 4)] <- NA
datT6b <- matrixNAneighbourImpute(datT6, gr=gl(2,3))
head(datT6b$data)