r_cluster_data {eclust}  R Documentation 
This is one of the functions for real data analysis, which will cluster the data based on the environment, as well as ignoring the environment
r_cluster_data(data, response, exposure, train_index, test_index,
cluster_distance = c("corr", "corr0", "corr1", "tom", "tom0", "tom1",
"diffcorr", "difftom", "fisherScore"), eclust_distance = c("fisherScore",
"corScor", "diffcorr", "difftom"), measure_distance = c("euclidean",
"maximum", "manhattan", "canberra", "binary", "minkowski"),
minimum_cluster_size = 50, ...)
data 
n x p matrix of data. rows are samples, columns are genes or cpg sites. Should not contain the environment variable 
response 
numeric vector of length n 
exposure 
binary (0,1) numeric vector of length n for the exposure status of the n samples 
train_index 
numeric vector indcating the indices of 
test_index 
numeric vector indcating the indices of 
cluster_distance 
character representing which matrix from the training set that you want to use to cluster the genes. Must be one of the following

eclust_distance 
character representing which matrix from the training
set that you want to use to cluster the genes based on the environment. See

measure_distance 
one of "euclidean","maximum","manhattan",
"canberra", "binary","minkowski" to be passed to 
minimum_cluster_size 
The minimum cluster size. Only applicable if

... 
arguments passed to the 
This function clusters the data. The results of this function should
then be passed to the r_prepare_data
function which output
the appropriate X and Y matrices in the right format for regression
packages such as mgcv
, caret
and glmnet
a list of length 8:
clustering results
based on the environment and not the environment. see
u_cluster_similarity
for
details
clustering results ignoring the environment. See
u_cluster_similarity
for details
vector of the exposure variable for the training set
the similarity matrix based on the
argument specified in
cluster_distance
the similarity
matrix based on the argument specified in
eclust_distance
a data.frame and data.table of the clustering membership for clustering results based on the environment and not the environment. As a result, each gene will show up twice in this table
a data.frame and data.table of the clustering membership for clustering results based on all subjects i.e. ignoring the environment. Each gene will only show up once in this table
a data.frame and data.table of the clustering membership for clustering results accounting for the environment. Each gene will only show up once in this table
data("tcgaov")
tcgaov[1:5,1:6, with = FALSE]
Y < log(tcgaov[["OS"]])
E < tcgaov[["E"]]
genes < as.matrix(tcgaov[,c("OS","rn","subtype","E","status"),with = FALSE])
trainIndex < drop(caret::createDataPartition(Y, p = 0.5, list = FALSE, times = 1))
testIndex < setdiff(seq_len(length(Y)),trainIndex)
## Not run:
cluster_res < r_cluster_data(data = genes,
response = Y,
exposure = E,
train_index = trainIndex,
test_index = testIndex,
cluster_distance = "tom",
eclust_distance = "difftom",
measure_distance = "euclidean",
clustMethod = "hclust",
cutMethod = "dynamic",
method = "average",
nPC = 1,
minimum_cluster_size = 60)
# the number of clusters determined by the similarity matrices specified
# in the cluster_distance and eclust_distance arguments. This will always be larger
# than cluster_res$clustersAll$nclusters which is based on the similarity matrix
# specified in the cluster_distance argument
cluster_res$clustersAddon$nclusters
# the number of clusters determined by the similarity matrices specified
# in the cluster_distance argument only
cluster_res$clustersAll$nclusters
## End(Not run)