r_cluster_data {eclust} R Documentation

## Cluster data using environmental exposure

### Description

This is one of the functions for real data analysis, which will cluster the data based on the environment, as well as ignoring the environment

### Usage

r_cluster_data(data, response, exposure, train_index, test_index,
cluster_distance = c("corr", "corr0", "corr1", "tom", "tom0", "tom1",
"diffcorr", "difftom", "fisherScore"), eclust_distance = c("fisherScore",
"corScor", "diffcorr", "difftom"), measure_distance = c("euclidean",
"maximum", "manhattan", "canberra", "binary", "minkowski"),
minimum_cluster_size = 50, ...)


### Arguments

 data n x p matrix of data. rows are samples, columns are genes or cpg sites. Should not contain the environment variable response numeric vector of length n exposure binary (0,1) numeric vector of length n for the exposure status of the n samples train_index numeric vector indcating the indices of response and the rows of data that are in the training set test_index numeric vector indcating the indices of response and the rows of data that are in the test set cluster_distance character representing which matrix from the training set that you want to use to cluster the genes. Must be one of the following corr, corr0, corr1, tom, tom0, tom1, diffcorr, difftom, corScor, tomScor, fisherScore eclust_distance character representing which matrix from the training set that you want to use to cluster the genes based on the environment. See cluster_distance for avaialble options. Should be different from cluster_distance. For example, if cluster_distance=corr and EclustDistance=fisherScore. That is, one should be based on correlations ignoring the environment, and the other should be based on correlations accounting for the environment. This function will always return this add on measure_distance one of "euclidean","maximum","manhattan", "canberra", "binary","minkowski" to be passed to dist function for calculating the distance for the clusters based on the corr,corr1,corr0, tom, tom0, tom1 matrices minimum_cluster_size The minimum cluster size. Only applicable if cutMethod='dynamic'. This argument is passed to the cutreeDynamic function through the u_cluster_similarity function. Default is 50. ... arguments passed to the u_cluster_similarity function

### Details

This function clusters the data. The results of this function should then be passed to the r_prepare_data function which output the appropriate X and Y matrices in the right format for regression packages such as mgcv, caret and glmnet

### Value

a list of length 8:

clustering results based on the environment and not the environment. see u_cluster_similarity for details

clustersAll

clustering results ignoring the environment. See u_cluster_similarity for details

etrain

vector of the exposure variable for the training set

cluster_distance_similarity

the similarity matrix based on the argument specified in cluster_distance

eclust_distance_similarity

the similarity matrix based on the argument specified in eclust_distance

a data.frame and data.table of the clustering membership for clustering results based on the environment and not the environment. As a result, each gene will show up twice in this table

clustersAllMembership

a data.frame and data.table of the clustering membership for clustering results based on all subjects i.e. ignoring the environment. Each gene will only show up once in this table

clustersEclustMembership

a data.frame and data.table of the clustering membership for clustering results accounting for the environment. Each gene will only show up once in this table

u_cluster_similarity

### Examples

data("tcgaov")
tcgaov[1:5,1:6, with = FALSE]
Y <- log(tcgaov[["OS"]])
E <- tcgaov[["E"]]
genes <- as.matrix(tcgaov[,-c("OS","rn","subtype","E","status"),with = FALSE])
trainIndex <- drop(caret::createDataPartition(Y, p = 0.5, list = FALSE, times = 1))
testIndex <- setdiff(seq_len(length(Y)),trainIndex)

## Not run:
cluster_res <- r_cluster_data(data = genes,
response = Y,
exposure = E,
train_index = trainIndex,
test_index = testIndex,
cluster_distance = "tom",
eclust_distance = "difftom",
measure_distance = "euclidean",
clustMethod = "hclust",
cutMethod = "dynamic",
method = "average",
nPC = 1,
minimum_cluster_size = 60)

# the number of clusters determined by the similarity matrices specified
# in the cluster_distance and eclust_distance arguments. This will always be larger
# than cluster_res$clustersAll$nclusters which is based on the similarity matrix
# specified in the cluster_distance argument
cluster_res$clustersAddon$nclusters

# the number of clusters determined by the similarity matrices specified
# in the cluster_distance argument only
cluster_res$clustersAll$nclusters

## End(Not run)


[Package eclust version 0.1.0 Index]