CScluster {CSFA}R Documentation

CScluster

Description

Apply the Connectivity Scores to a K clustering result. More information can be found in the Details section below.

Usage

CScluster(data, clusterlabels, type = "CSmfa", WithinABS = TRUE,
  BetweenABS = TRUE, FactorABS = FALSE, verbose = FALSE, Within = NULL,
  Between = NULL, WithinSave = FALSE, BetweenSave = TRUE, ...)

Arguments

data

A gene expression matrix with the compounds in the columns.

clusterlabels

A vector of integers that represents the cluster grouping of the columns (compounds) in data. The labels should be integers starting from 1 to the total number of clusters. (e.g. the output of cutree)

type

Type of CS anaylsis (default="CSmfa"):

  • "CSmfa" (MFA or PCA)

  • "CSsmfa" (Sparse MFA or Sparse PCA)

  • "CSfabia" (Fabia)

  • "CSzhang" (Zhang and Gant)

In the first two options, either MFA or PCA is used depending on the cluster size. If the query set only contains a single compound, the latter is used. Also note that if a cluster only contains a single compound, no Within-CS can be computed.

WithinABS

Boolean value to take the mean of the absolute values in the final step of the Within-Cluster CS (default=TRUE).

BetweenABS

Boolean value to take the mean of the absolute values in the final step of the Between-Cluster CS (default=TRUE).

FactorABS

Boolean value to take the absolute value of the query loadings when determining the best factor (= factor with highest query loadings) in a CSanalysis application (default=FALSE). This option might be helpful if the 'best factor' contains large positive and negative query loading which would average to zero.

verbose

Boolean value to output warnings and information about which factor is chosen in a CS analysis (if applicable).

Within

A vector for which cluster numbers the Within-Cluster CS should be computed. By default (=NULL) all within-cluster scores are computed, but this might not be feasible for larger data in which a single CSanalysis run might already take a sufficient amount of computation time.

Between

A vector fir which cluster numbers the Beween-Cluster CS (with the cluster as a query set) should be computed. By default (=NULL) all between-cluster scores are computed, but this might not be feasible for larger data in which a single CSanalysis run might already take a sufficient amount of computation time.

WithinSave

Boolean value to save the Within object in the Save slot of the returned list (default=FALSE).

BetweenSave

Boolean value to save the Between object in the Save slot of the returned list (default=TRUE).

...

Additional parameters given to CSanalysis specific to a certain type of CS analysis.

Details

After applying cluster analysis on the additional data matrix, K clusters are obtained. Each cluster will be seen as a potential query set (for CSanalysis) for which 2 connectivity score metrics can be computed, the Within-Cluster CS and the Between-Cluster CS.

Within-Cluster CS
This metric will answer the question if the kth cluster is connected on a gene expression level (in addition to the samples being similar based on the other data source). The Within-Cluster CS for a cluster is computed as following:

  1. Repeatedly for the ith sample in the kth cluster, apply CSMFA with:

  2. The Within-Cluster CS for cluster k is now defined as the average of all retrieved CS.

The concept of this metric is to investigate the connectivity for each compound with the cluster. The average of the 'leave-one-out' connectivity scores, the Within-Cluster CS, gives an indication of the gene expression connectivity of this cluster. A high Within-Cluster CS implies that the cluster is both similar on the external data source and on the gene expression level. A low score indicates that the cluster does not share a similar latent gene profile structure.

Between-Cluster CS
In this stage of the analysis, we focus on the lth cluster and use all compounds in this cluster as the query set. A CSMFA is performed in which all other clusters are the reference set. Next, the connectivity scores are calculated for all reference compounds and averaged over the clusters (=the between connectivity score). A high Between-Cluster CS between the lth and jth clusters implies that, while the two clusters are not similar based on the other data source, they do share a latent structure when considering the gene expression data.

Value

A list object with components:

Author(s)

Ewoud De Troyer

Examples

 

  # Example Data Set
  data("dataSIM",package="CSFA")
  # Remove some no-connectivity compounds
  nosignal <- sapply(colnames(dataSIM),FUN=function(x){grepl("c-",x)})
  data <- dataSIM[,-which(nosignal)[1:250]]
  
  # Toy example with random cluster assignment:
  # Note: clusterlabels can be acquired through cutree(hclust(...))
  clusterlabels <- sample(1:10,size=ncol(data),replace=TRUE)
  
  result1 <- CScluster(data,clusterlabels,type="CSmfa")
  result2 <- CScluster(data,clusterlabels,type="CSzhang")
  
  result1$CSmatrix
  result1$CSRankmatrix
  
  result2$CSmatrix


[Package CSFA version 1.2.0 Index]