impute.slsa {imp4p} | R Documentation |
Imputing missing values using an adaptation of the LSimpute algorithm (Bo et al. (2004)) to experimental designs. This algorithm is named "Structured Least Squares Algorithm" (SLSA).
Description
This function is an adaptation of the LSimpute algorithm (Bo et al. (2004)) to experimental designs usually met in MS-based quantitative proteomics.
Usage
impute.slsa(tab, conditions, repbio=NULL, reptech=NULL, nknn=30, selec="all", weight="o",
ind.comp=1, progress.bar=TRUE)
Arguments
tab |
A data matrix containing numeric and missing values. Each column of this matrix is assumed to correspond to an experimental sample, and each row to an identified peptide. |
conditions |
A vector of factors indicating the biological condition to which each sample belongs. |
repbio |
A vector of factors indicating the biological replicate to which each sample belongs. Default is NULL (no experimental design is considered). |
reptech |
A vector of factors indicating the technical replicate to which each sample belongs. Default is NULL (no experimental design is considered). |
nknn |
The number of nearest neighbours used in the algorithm (see Details). |
selec |
A parameter to select a part of the dataset to find nearest neighbours between rows. This can be useful for big data sets (see Details). |
weight |
The way of weighting in the algorithm (see Details). |
ind.comp |
If |
progress.bar |
If |
Details
This function imputes the missing values condition by condition. The rows of the input matrix are imputed when they have at least one observed value in the considered condition. For the rows having only missing values in a condition, you can use the impute.pa
function.
For each row, a similarity measure between the observed values of this row and the ones of the other rows is computed. The similarity measure which is used is the absolute pairwise correlation coefficient if at least three side-by-side values are observed, and the inverse of the euclidean distance between side-by-side observed values in the other cases.
For big data sets, this step can be time consuming and that is why the input parameter selec
allows to select random rows in the data set. If selec="all"
, then all the rows of the data set are considered; while if selec
is a numeric value, for instance selec=100
, then only 100 random rows are selected in the data set for computing similarity measures with each row containing missing values.
Once similarity measures are computed for a specific row, then the nknn
rows with the highest similarity measures are considered to fit linear models and to predict several estimates for each missing value (see Bo et al. (2004)). If ind.comp=1
, then only nearest neighbours without missing values in the condition are considered. However, unlike the original algorithm, our algorithm allows to consider the design of experiments that are specified in input through the vectors conditions
, repbio
and reptech
. Note that conditions
has to get a lower number of levels than repbio
; and repbio
has to get a lower number of levels than reptech
.
In the original algorithm, several predictions of each missing value are done from the estimated linear models and, then, they are weighted in function of their similarity measure and summed (see Bo et al. (2004)). In our algorithm, one can use the original weighting function of Bo et al. (2004) if weight="o"
, i.e. (sim^2/(1-sim^2+1e-06))^2
where sim
is the similarity measure; or the weighting function sim^weight
if weight
is a numeric value.
Value
The input matrix tab
with imputed values instead of missing values.
Author(s)
Quentin Giai Gianetto <quentin2g@yahoo.fr>
References
Bo, T. H., Dysvik, B., & Jonassen, I. (2004). LSimpute: accurate estimation of missing values in microarray data with least squares methods. Nucleic acids research, 32(3), e34.
Examples
#Simulating data
res.sim=sim.data(nb.pept=2000,nb.miss=600);
#Imputation of missing values with the slsa algorithm
dat.slsa=impute.slsa(tab=res.sim$dat.obs,conditions=res.sim$condition,repbio=res.sim$repbio);