r_cluster_data {eclust} | R Documentation |
Cluster data using environmental exposure
Description
This is one of the functions for real data analysis, which will cluster the data based on the environment, as well as ignoring the environment
Usage
r_cluster_data(data, response, exposure, train_index, test_index,
cluster_distance = c("corr", "corr0", "corr1", "tom", "tom0", "tom1",
"diffcorr", "difftom", "fisherScore"), eclust_distance = c("fisherScore",
"corScor", "diffcorr", "difftom"), measure_distance = c("euclidean",
"maximum", "manhattan", "canberra", "binary", "minkowski"),
minimum_cluster_size = 50, ...)
Arguments
data |
n x p matrix of data. rows are samples, columns are genes or cpg sites. Should not contain the environment variable |
response |
numeric vector of length n |
exposure |
binary (0,1) numeric vector of length n for the exposure status of the n samples |
train_index |
numeric vector indcating the indices of |
test_index |
numeric vector indcating the indices of |
cluster_distance |
character representing which matrix from the training set that you want to use to cluster the genes. Must be one of the following
|
eclust_distance |
character representing which matrix from the training
set that you want to use to cluster the genes based on the environment. See
|
measure_distance |
one of "euclidean","maximum","manhattan",
"canberra", "binary","minkowski" to be passed to |
minimum_cluster_size |
The minimum cluster size. Only applicable if
|
... |
arguments passed to the |
Details
This function clusters the data. The results of this function should
then be passed to the r_prepare_data
function which output
the appropriate X and Y matrices in the right format for regression
packages such as mgcv
, caret
and glmnet
Value
a list of length 8:
- clustersAddon
clustering results based on the environment and not the environment. see
u_cluster_similarity
for details- clustersAll
clustering results ignoring the environment. See
u_cluster_similarity
for details- etrain
vector of the exposure variable for the training set
- cluster_distance_similarity
the similarity matrix based on the argument specified in
cluster_distance
- eclust_distance_similarity
the similarity matrix based on the argument specified in
eclust_distance
- clustersAddonMembership
a data.frame and data.table of the clustering membership for clustering results based on the environment and not the environment. As a result, each gene will show up twice in this table
- clustersAllMembership
a data.frame and data.table of the clustering membership for clustering results based on all subjects i.e. ignoring the environment. Each gene will only show up once in this table
- clustersEclustMembership
a data.frame and data.table of the clustering membership for clustering results accounting for the environment. Each gene will only show up once in this table
See Also
Examples
data("tcgaov")
tcgaov[1:5,1:6, with = FALSE]
Y <- log(tcgaov[["OS"]])
E <- tcgaov[["E"]]
genes <- as.matrix(tcgaov[,-c("OS","rn","subtype","E","status"),with = FALSE])
trainIndex <- drop(caret::createDataPartition(Y, p = 0.5, list = FALSE, times = 1))
testIndex <- setdiff(seq_len(length(Y)),trainIndex)
## Not run:
cluster_res <- r_cluster_data(data = genes,
response = Y,
exposure = E,
train_index = trainIndex,
test_index = testIndex,
cluster_distance = "tom",
eclust_distance = "difftom",
measure_distance = "euclidean",
clustMethod = "hclust",
cutMethod = "dynamic",
method = "average",
nPC = 1,
minimum_cluster_size = 60)
# the number of clusters determined by the similarity matrices specified
# in the cluster_distance and eclust_distance arguments. This will always be larger
# than cluster_res$clustersAll$nclusters which is based on the similarity matrix
# specified in the cluster_distance argument
cluster_res$clustersAddon$nclusters
# the number of clusters determined by the similarity matrices specified
# in the cluster_distance argument only
cluster_res$clustersAll$nclusters
## End(Not run)