R: Imputation of data sets containing peptide intensities with a...

impute.mi {imp4p}

R Documentation

Imputation of data sets containing peptide intensities with a multiple imputation strategy.

Description

This function allows imputing data sets containing peptide intensities with a multiple imputation strategy distinguishing MCAR and MNAR values. For details, see Giai Gianetto Q. et al. (2020) (doi: doi: 10.1101/2020.05.29.122770).

Usage

impute.mi(tab, conditions, repbio=NULL, reptech=NULL, nb.iter=3, nknn=15, selec=1000,
siz=900, weight=1, ind.comp=1, progress.bar=TRUE, x.step.mod=300, x.step.pi=300,
nb.rei=100, q=0.95, methodMCAR="mle",ncp.max=5,
maxiter = 10, ntree = 100, variablewise = FALSE,
decreasing = FALSE, verbose = FALSE, mtry = floor(sqrt(ncol(tab))),
replace = TRUE, classwt = NULL, cutoff = NULL, strata = NULL, sampsize = NULL,
nodesize = NULL, maxnodes = NULL, xtrue = NA,
parallelize = c('no', 'variables', 'forests'),
methodMNAR="igcda",q.min = 0.025, q.norm = 3, eps = 0,
distribution = "unif", param1 = 3, param2 = 1, R.q.min=1);

Arguments

`tab`	A data matrix containing only numeric and missing values. Each column of this matrix is assumed to correspond to an experimental sample, and each row to an identified peptide.
`conditions`	A vector of factors indicating the biological condition to which each column (experimental sample) belongs.
`repbio`	A vector of factors indicating the biological replicate to which each column belongs. Default is NULL (no experimental design is considered).
`reptech`	A vector of factors indicating the technical replicate to which each column belongs. Default is NULL (no experimental design is considered).
`nb.iter`	The number of iterations used for the multiple imputation method (see `mi.mix`).
`methodMCAR`	The method used for imputing MCAR data. If `methodMCAR="mle"` (default), then the `impute.mle` function is used (imputation using an EM algorithm). If `methodMCAR="pca"`, then the `impute.PCA` function is used (imputation using Principal Component Analysis). If `methodMCAR="rf"`, then the `impute.RF` function is used (imputation using Random Forest). Else, the `impute.slsa` function is used (imputation using Least Squares on nearest neighbours).
`methodMNAR`	The method used for imputing MNAR data. If `methodMNAR="igcda"` (default), then the `impute.igcda` function is used. Else, the `impute.pa` function is used.
`nknn`	The number of nearest neighbours used in the SLSA algorithm (see `impute.slsa`).
`selec`	A parameter to select a part of the dataset to find nearest neighbours between rows. This can be useful for big data sets (see `impute.slsa`).
`siz`	A parameter to select a part of the dataset to perform imputations with the MCAR-devoted algorithm. This can be useful for big data sets (see `mi.mix`).
`weight`	The way of weighting in the algorithm (see `impute.slsa`).
`ind.comp`	If `ind.comp=1`, only nearest neighbours without missing values are selected to fit linear models (see `impute.slsa`). Else, they can contain missing values.
`progress.bar`	If `TRUE`, a progress bar is displayed.
`x.step.mod`	The number of points in the intervals used for estimating the cumulative distribution functions of the mixing model in each column (see `estim.mix`).
`x.step.pi`	The number of points in the intervals used for estimating the proportion of MCAR values in each column (see `estim.mix`).
`nb.rei`	The number of initializations of the minimization algorithm used to estimate the proportion of MCAR values (see Details) (see `estim.mix`).
`q`	A quantile value (see `impute.igcda`).
`ncp.max`	parameter of the `impute.PCA` function.
`maxiter`	parameter of the `impute.RF` function.
`ntree`	parameter of the `impute.RF` function.
`variablewise`	parameter of the `impute.RF` function.
`decreasing`	parameter of the `impute.RF` function.
`verbose`	parameter of the `impute.RF` function.
`mtry`	parameter of the `impute.RF` function.
`replace`	parameter of the `impute.RF` function.
`classwt`	parameter of the `impute.RF` function.
`cutoff`	parameter of the `impute.RF` function.
`strata`	parameter of the `impute.RF` function.
`sampsize`	parameter of the `impute.RF` function.
`nodesize`	parameter of the `impute.RF` function.
`maxnodes`	parameter of the `impute.RF` function.
`xtrue`	parameter of the `impute.RF` function.
`parallelize`	parameter of the `impute.RF` function.
`q.min`	parameter of the `impute.pa` function.
`q.norm`	parameter of the `impute.pa` function.
`eps`	parameter of the `impute.pa` function.
`distribution`	parameter of the `impute.pa` function.
`param1`	parameter of the `impute.pa` function.
`param2`	parameter of the `impute.pa` function.
`R.q.min`	parameter of the `impute.pa` function.

Details

First, a mixture model of MCAR and MNAR values is estimated in each column of tab. This model is used to estimate probabilities that each missing value is MCAR. Then, these probabilities are used to perform a multiple imputation strategy (see mi.mix). Rows with no value in a condition are imputed using the impute.pa function. More details and explanations can be bound in Giai Gianetto (2020).

Value

The input matrix tab with imputed values instead of missing values.

Author(s)

Quentin Giai Gianetto <quentin2g@yahoo.fr>

References

Giai Gianetto, Q., Wieczorek S., Couté Y., Burger, T. (2020). A peptide-level multiple imputation strategy accounting for the different natures of missing values in proteomics data. bioRxiv 2020.05.29.122770; doi: doi: 10.1101/2020.05.29.122770

Examples


#Simulating data
res.sim=sim.data(nb.pept=2000,nb.miss=600,nb.cond=2);

#Imputation of the dataset noting the conditions to which the samples belong.
result=impute.mi(tab=res.sim$dat.obs, conditions=res.sim$conditions);

#Imputation of the dataset noting the conditions to which the samples belong
#and also their biological replicate, and using the SLSA method for the MCAR values
result=impute.mi(tab=res.sim$dat.obs, conditions=res.sim$conditions,
repbio=res.sim$repbio, methodMCAR = "slsa");

#For large data sets, the SLSA imputation can be accelerated thanks to the selec parameter
#and the siz parameter (see impute.slsa and mi.mix)
#but it may result in a less accurate data imputation. Note that selec has to be greater than siz.
#Here, nb.iter is fixed to 3
result1=impute.mi(tab=res.sim$dat.obs, conditions=res.sim$conditions, progress.bar=TRUE,
selec=400, siz=300, nb.iter=3, methodMCAR = "slsa", methodMNAR = "igcda");

[Package imp4p version 1.2 Index]