r_cluster_data {eclust}R Documentation

Cluster data using environmental exposure


This is one of the functions for real data analysis, which will cluster the data based on the environment, as well as ignoring the environment


r_cluster_data(data, response, exposure, train_index, test_index,
  cluster_distance = c("corr", "corr0", "corr1", "tom", "tom0", "tom1",
  "diffcorr", "difftom", "fisherScore"), eclust_distance = c("fisherScore",
  "corScor", "diffcorr", "difftom"), measure_distance = c("euclidean",
  "maximum", "manhattan", "canberra", "binary", "minkowski"),
  minimum_cluster_size = 50, ...)



n x p matrix of data. rows are samples, columns are genes or cpg sites. Should not contain the environment variable


numeric vector of length n


binary (0,1) numeric vector of length n for the exposure status of the n samples


numeric vector indcating the indices of response and the rows of data that are in the training set


numeric vector indcating the indices of response and the rows of data that are in the test set


character representing which matrix from the training set that you want to use to cluster the genes. Must be one of the following

  • corr, corr0, corr1, tom, tom0, tom1, diffcorr, difftom, corScor, tomScor, fisherScore


character representing which matrix from the training set that you want to use to cluster the genes based on the environment. See cluster_distance for avaialble options. Should be different from cluster_distance. For example, if cluster_distance=corr and EclustDistance=fisherScore. That is, one should be based on correlations ignoring the environment, and the other should be based on correlations accounting for the environment. This function will always return this add on


one of "euclidean","maximum","manhattan", "canberra", "binary","minkowski" to be passed to dist function for calculating the distance for the clusters based on the corr,corr1,corr0, tom, tom0, tom1 matrices


The minimum cluster size. Only applicable if cutMethod='dynamic'. This argument is passed to the cutreeDynamic function through the u_cluster_similarity function. Default is 50.


arguments passed to the u_cluster_similarity function


This function clusters the data. The results of this function should then be passed to the r_prepare_data function which output the appropriate X and Y matrices in the right format for regression packages such as mgcv, caret and glmnet


a list of length 8:


clustering results based on the environment and not the environment. see u_cluster_similarity for details


clustering results ignoring the environment. See u_cluster_similarity for details


vector of the exposure variable for the training set


the similarity matrix based on the argument specified in cluster_distance


the similarity matrix based on the argument specified in eclust_distance


a data.frame and data.table of the clustering membership for clustering results based on the environment and not the environment. As a result, each gene will show up twice in this table


a data.frame and data.table of the clustering membership for clustering results based on all subjects i.e. ignoring the environment. Each gene will only show up once in this table


a data.frame and data.table of the clustering membership for clustering results accounting for the environment. Each gene will only show up once in this table

See Also



tcgaov[1:5,1:6, with = FALSE]
Y <- log(tcgaov[["OS"]])
E <- tcgaov[["E"]]
genes <- as.matrix(tcgaov[,-c("OS","rn","subtype","E","status"),with = FALSE])
trainIndex <- drop(caret::createDataPartition(Y, p = 0.5, list = FALSE, times = 1))
testIndex <- setdiff(seq_len(length(Y)),trainIndex)

## Not run: 
cluster_res <- r_cluster_data(data = genes,
                              response = Y,
                              exposure = E,
                              train_index = trainIndex,
                              test_index = testIndex,
                              cluster_distance = "tom",
                              eclust_distance = "difftom",
                              measure_distance = "euclidean",
                              clustMethod = "hclust",
                              cutMethod = "dynamic",
                              method = "average",
                              nPC = 1,
                              minimum_cluster_size = 60)

# the number of clusters determined by the similarity matrices specified
# in the cluster_distance and eclust_distance arguments. This will always be larger
# than cluster_res$clustersAll$nclusters which is based on the similarity matrix
# specified in the cluster_distance argument

# the number of clusters determined by the similarity matrices specified
# in the cluster_distance argument only

## End(Not run)

