mvrt.test {MVR}R Documentation

Function for Computing Mean-Variance Regularized T-test Statistic and Its Significance

Description

End-user function for computing MVR t-test statistic and its significance (p-value) under sample group homoscedasticity or heteroscedasticity assumption.

Return an object of class "mvrt.test". Offers the option of parallel computation for improved efficiency.

Usage

    mvrt.test(data, 
              obj=NULL,
              block,
              tolog = FALSE, 
              nc.min = 1, 
              nc.max = 30, 
              pval = FALSE, 
              replace = FALSE, 
              n.resamp = 100, 
              parallel = FALSE,
              conf = NULL,
              verbose = TRUE, 
              seed = NULL)

Arguments

data

numeric matrix of untransformed (raw) data, where samples are by rows and variables (to be clustered) are by columns, or an object that can be coerced to such a matrix (such as a numeric vector or a data.frame with all numeric columns). Missing values (NA), NotANumber values (NaN) or Infinite values (Inf) are not allowed.

obj

Object of class mvr returned by mvr.

block

character or numeric vector, or factor of group membership indicator variable (grouping/blocking variable) of length the data sample size with as many different values or levels as the number of data sample groups. Defaults to single group situation. See details.

tolog

logical scalar. Is the data to be log2-transformed first? Optional, defaults to FALSE. Note that negative or null values will be changed to 1 before taking log2-transformation.

nc.min

Positive integer scalar of the minimum number of clusters, defaults to 1

nc.max

Positive integer scalar of the maximum number of clusters, defaults to 30

pval

logical scalar. Shall p-values be computed? If not, n.resamp and replace will be ignored. If FALSE (default), t-statistic only will be computed, If TRUE, exact (permutation test) or approximate (bootstrap test) p-values will be computed.

replace

logical scalar. Shall permutation test (default) or bootstrap test be computed? If FALSE (default), permutation test will be computed with null permutation distribution, If TRUE, bootstrap test will be computed with null bootstrap distribution.

n.resamp

Positive integer scalar of the number of resamplings to compute (default=100) by permutation or bootstsrap (see details).

parallel

logical scalar. Is parallel computing to be performed? Optional, defaults to FALSE.

conf

list of 5 fields containing the parameters values needed for creating the parallel backend (cluster configuration). See details below for usage. Optional, defaults to NULL, but all fields are required if used:

  • type : character vector specifying the cluster type ("SOCKET", "MPI").

  • spec : A specification (character vector or integer scalar) appropriate to the type of cluster.

  • homogeneous : logical scalar to be set to FALSE for inhomogeneous clusters.

  • verbose : logical scalar to be set to FALSE for quiet mode.

  • outfile : character vector of an output log file name to direct the stdout and stderr connection output from the workernodes. "" indicates no redirection.

verbose

logical scalar. Is the output to be verbose? Optional, defaults to TRUE.

seed

Positive integer scalar of the user seed to reproduce the results.

Details

Argument block will be converted to a factor, whose levels will match the data groups. It defaults to a single group situation, that is, under the assumption of equal variance between sample groups. All group sample sizes must be greater than 1, otherwise the program will stop.

Argument nc.max currently defaults to 30. Empirically, we found that this is enough for most datasets tested. This depends on (i) the dimensionality/sample size ratio \frac{p}{n}, (ii) the signal/noise ratio, and (iii) whether a pre-transformation has been applied (see Dazard, J-E. and J. S. Rao (2012) for more details). See the cluster diagnostic function cluster.diagnostic for more details, whether larger values of nc.max may be required.

To save un-necessary computations, previously computed MVR clustering can be provided through option obj (i.e. obj is fully specified as a mvr object). In this case, arguments data, block, tolog, nc.min, nc.max are ignored. If obj is fully specified (i.e. an object of class "mvr" returned by mvr), the the MVR clustering provided by obj will be used for the computation of the regularized t-test statistics. If obj=NULL, a MVR clustering computation for the regularized t-test statistics and/or p-values will be performed.

The function mvrt.test relies on the R package parallel to create a parallel backend within an R session, enabling access to a cluster of compute cores and/or nodes on a local and/or remote machine(s) and scaling-up with the number of CPU cores available and efficient parallel execution. To run a procedure in parallel (with parallel RNG), argument parallel is to be set to TRUE and argument conf is to be specified (i.e. non NULL). Argument conf uses the options described in function makeCluster of the R packages parallel and snow. PRIMsrc supports two types of communication mechanisms between master and worker processes: 'Socket' or 'Message-Passing Interface' ('MPI'). In PRIMsrc, parallel 'Socket' clusters use sockets communication mechanisms only (no forking) and are therefore available on all platforms, including Windows, while parallel 'MPI' clusters use high-speed interconnects mechanism in networks of computers (with distributed memory) and are therefore available only in these architectures. A parallel 'MPI' cluster also requires R package Rmpi to be installed first. Value type is used to setup a cluster of type 'Socket' ("SOCKET") or 'MPI' ("MPI"), respectively. Depending on this type, values of spec are to be used alternatively:

The actual creation of the cluster, its initialization, and closing are all done internally. For more details, see the reference manual of R package snow and examples below.

When random number generation is needed, the creation of separate streams of parallel RNG per node is done internally by distributing the stream states to the nodes. For more details, see the vignette of R package parallel. The use of a seed allows to reproduce the results within the same type of session: the same seed will reproduce the same results within a non-parallel session or within a parallel session, but it will not necessarily give the exact same results (up to sampling variability) between a non-parallelized and parallelized session due to the difference of management of the seed between the two (see parallel RNG and value of returned seed below).

In case p-values are desired (pval=TRUE), the use of a cluster is highly recommended. It is ideal for computing embarassingly parallel tasks such as permutation or bootstrap resamplings. Note that in case both regularized t-test statistics and p-values are desired, in order to maximize computational efficiency and avoid multiple configurations (since a cluster can only be configured and used one session at a time, which otherwise would result in a run stop), the cluster configuration will only be used for the parallel computation of p-values, but not for the MVR clustering computation of the regularized t-test statistics.

Value

statistic

vector, of size the number of variables, where entries are the t-statistics values of each variable.

p.value

vector, of size the number of variables, where entries are the p-values (if requested, otherwise NULL value) of each variable.

seed

User seed(s) used: integer of a single value, if parallelization is used. integer vector of values, one for each replication, if parallelization is not used.

Acknowledgments

This work made use of the High Performance Computing Resource in the Core Facility for Advanced Research Computing at Case Western Reserve University. This project was partially funded by the National Institutes of Health (P30-CA043703).

Note

End-user function.

Author(s)

Maintainer: "Jean-Eudes Dazard, Ph.D." jean-eudes.dazard@case.edu

References

See Also

Examples

#================================================
# Loading the library and its dependencies
#================================================
library("MVR")

## Not run: 
    #===================================================
    # MVR package news
    #===================================================
    MVR.news()

    #================================================
    # MVR package citation
    #================================================
    citation("MVR")

    #===================================================
    # Loading of the Synthetic and Real datasets
    # Use help for descriptions
    #===================================================
    data("Synthetic", "Real", package="MVR")
    ?Synthetic
    ?Real

## End(Not run)

#================================================
# Regularized t-test statistics (Synthetic dataset) 
# Multi-Group Assumption
# Assuming unequal variance between groups
# With option to use prior MVR clustering results
# Without computation of p-values
# Without cluster usage
#================================================
nc.min <- 1
nc.max <- 10
probs <- seq(0, 1, 0.01)
n <- 10
GF <- factor(gl(n = 2, k = n/2, length = n), 
             ordered = FALSE, 
             labels = c("G1", "G2"))
mvr.obj <- mvr(data = Synthetic, 
               block = GF, 
               tolog = FALSE, 
               nc.min = nc.min, 
               nc.max = nc.max, 
               probs = probs,
               B = 100,
               parallel = FALSE, 
               conf = NULL,
               verbose = TRUE,
               seed = 1234)
mvrt.obj <- mvrt.test(data = NULL,
                      obj = mvr.obj,
                      block = NULL,
                      pval = FALSE,
                      replace = FALSE,
                      n.resamp = 100,
                      parallel = FALSE,
                      conf = NULL,
                      verbose = TRUE,
                      seed = 1234)       
## Not run: 
    #===================================================
    # Examples of parallel backend parametrization 
    #===================================================
    if (require("parallel")) {
       print("'parallel' is attached correctly \n")
    } else {
       stop("'parallel' must be attached first \n")
    }
    #===================================================
    # Example #1 - Quad core PC 
    # Running WINDOWS with SOCKET communication
    #===================================================
    cpus <- parallel::detectCores(logical = TRUE)
    conf <- list("spec" = rep("localhost", cpus),
                 "type" = "SOCKET",
                 "homo" = TRUE,
                 "verbose" = TRUE,
                 "outfile" = "")
    #===================================================
    # Example #2 - Master node + 3 Worker nodes cluster
    # Running LINUX with SOCKET communication
    # All nodes equipped with identical setups of 
    # multicores (8 core CPUs per machine for a total of 32)
    #===================================================
    masterhost <- Sys.getenv("HOSTNAME")
    slavehosts <- c("compute-0-0", "compute-0-1", "compute-0-2")
    nodes <- length(slavehosts) + 1
    cpus <- 8
    conf <- list("spec" = c(rep(masterhost, cpus),
                            rep(slavehosts, cpus)),
                 "type" = "SOCKET",
                 "homo" = TRUE,
                 "verbose" = TRUE,
                 "outfile" = "")
    #===================================================
    # Example #3 - Multinode of multicore per node cluster
    # Running LINUX with SLURM scheduler and MPI communication
    # Below, variable 'cpus' is the total number 
    # of requested core CPUs, which is specified from  
    # within a SLURM script.
    #===================================================
    if (require("Rmpi")) {
        print("'Rmpi' is attached correctly \n")
    } else {
        stop("'Rmpi' must be attached first \n")
    }
    cpus <- as.numeric(Sys.getenv("SLURM_NTASKS"))
    conf <- list("spec" = cpus,
                 "type" = "MPI",
                 "homo" = TRUE,
                 "verbose" = TRUE,
                 "outfile" = "")
    #===================================================
    # Mean-Variance Regularization (Real dataset)
    # Multi-Group Assumption
    # Assuming unequal variance between groups
    #===================================================
    nc.min <- 1
    nc.max <- 30
    probs <- seq(0, 1, 0.01)
    n <- 6
    GF <- factor(gl(n = 2, k = n/2, length = n), 
                 ordered = FALSE, 
                 labels = c("M", "S"))
    mvr.obj <- mvr(data = Real, 
                   block = GF, 
                   tolog = FALSE, 
                   nc.min = nc.min, 
                   nc.max = nc.max, 
                   probs = probs,
                   B = 100, 
                   parallel = TRUE, 
                   conf = conf,
                   verbose = TRUE,
                   seed = 1234)
    #===================================================
    # Regularized t-test statistics (Real dataset) 
    # Multi-Group Assumption
    # Assuming unequal variance between groups
    # With option to use prior MVR clustering results
    # With computation of p-values
    #===================================================
    mvrt.obj <- mvrt.test(data = NULL,
                          obj = mvr.obj,
                          block = NULL,
                          pval = TRUE,
                          replace = FALSE,
                          n.resamp = 100,
                          parallel = TRUE,
                          conf = conf,
                          verbose = TRUE,
                          seed = 1234)
    
## End(Not run)

[Package MVR version 1.33.0 Index]