rmad {RSC} | R Documentation |
RMAD correlation matrix
Description
Compute the RMAD robust correlation matrix proposed in Serra et al. (2018) based on the robust correlation coefficient proposed in Pasman and Shevlyakov (1987).
Usage
rmad(x , y = NULL, na.rm = FALSE , even.correction = FALSE, num.threads = "half-max")
Arguments
x |
A numeric vector, a matrix or a data.frame. If |
y |
A numerical vector if not |
na.rm |
A logical value, if |
even.correction |
A logical value, if |
num.threads |
An integer value or the string |
Details
The rmad
function computes the correlation matrix based on the
pairwise robust correlation coefficient of Pasman and Shevlyakov
(1987). This correlation coefficient is based on repeated median
calculations for all pairs of variables. This is a computational
intensive task when the number of variables (that is ncol(x)
)
is large.
The software is optimized for large dimensional data sets, the median
is approximated as the central observation obtained based on the
find algorithm of Hoare (1961) (also known as quickselect)
implemented in C language. For small samples this may be a crude
approximation, however, it makes the computational cost feasible for
high-dimensional data sets. With the option even.correction
= TRUE
a correction is applied to reduce the bias for data sets with
an even number of samples. Although even.correction = TRUE
has a small computational cost for each pair of variables, it is
suggested to use the default even.correction = FALSE
for large
dimensional data sets.
The function can handle a data matrix with missing values (NA
records). If na.rm = TRUE
then missing values are handled by
casewise deletion (and if there are no complete cases, an error is
returned). In practice, if na.rm = TRUE
all rows of
x
that contain at least an NA
are removed.
Since the software is optimized to work with high-dimensional data sets,
the output RMAD matrix is packed into a storage efficient format
using the "dspMatrix"
S4 class from the Matrix
package. The latter is specifically designed for dense real symmetric
matrices. A sparse correlation matrix can be obtained applying
thresholding using the rsc_cv
and rsc
.
rmad
function supports parallel execution.
This is provided via openmp (http://www.openmp.org), which must be already available on the system at installation time;
otherwise, falls back to single-core execution.
For later installation of openmp, the RSC package needs to be re-installed (re-compiled) to provide multi-threads execution.
If num.threads > 0
, function is executed using min(num.threads, max.threads)
threads, where max.threads
is the maximum number of available threads. That is, if positive the specified number of threads (up to the maximum available) are used.
If num.threads < 0
, function is executed using max(max.threads - num.threads, 1)
threads, i.e. when negative num.threads
indicates the number of threads not to use (at least one thread is used).
If num.threads == 0
, a single thread is used (equivalent to num.threads = 1
).
If num.threads == "half-max"
, function is executed using half of the available threads (max(max.threads/2, 1)
). This is the default.
Value
If x
is a matrix or a data.frame, returns a correlation matrix of class "dspMatrix"
(S4 class object)
as defined in the Matrix
package.
If x
and y
are numerical vectors, returns a numerical value, that is the RMAD correlation coefficient
between x
and y
.
References
Hoare, C. A. (1961). Algorithm 65: find. Communications of the ACM, 4(7), 321-322.
Musser, D. R. (1997). Introspective sorting and selection algorithms. Software: Practice and Experience, 27(8), 983-993.
Pasman,V. and Shevlyakov,G. (1987). Robust methods of estimation of correlation coefficient. Automation Remote Control, 48, 332-340.
Serra, A., Coretto, P., Fratello, M., and Tagliaferri, R. (2018). Robust and sparsecorrelation matrix estimation for the analysis of high-dimensional genomics data. Bioinformatics, 34(4), 625-634. doi: 10.1093/bioinformatics/btx642
See Also
rsc_cv
, rsc
Examples
## simulate a random sample from a multivariate Cauchy distribution
set.seed(1)
n <- 100 # sample size
p <- 7 # dimension
dat <- matrix(rt(n*p, df = 1), nrow = n, ncol = p)
colnames(dat) <- paste0("Var", 1:p)
## compute the rmad correlation coefficient between dat[,1] and dat[,2]
a <- rmad(x = dat[,1], y = dat[,2])
## compute the RMAD correlaiton matrix
b <- rmad(x = dat)
b