R: Cluster data using environmental exposure

r_cluster_data {eclust}

R Documentation

Cluster data using environmental exposure

Description

This is one of the functions for real data analysis, which will cluster the data based on the environment, as well as ignoring the environment

Usage

r_cluster_data(data, response, exposure, train_index, test_index,
  cluster_distance = c("corr", "corr0", "corr1", "tom", "tom0", "tom1",
  "diffcorr", "difftom", "fisherScore"), eclust_distance = c("fisherScore",
  "corScor", "diffcorr", "difftom"), measure_distance = c("euclidean",
  "maximum", "manhattan", "canberra", "binary", "minkowski"),
  minimum_cluster_size = 50, ...)

Arguments

`data`	n x p matrix of data. rows are samples, columns are genes or cpg sites. Should not contain the environment variable
`response`	numeric vector of length n
`exposure`	binary (0,1) numeric vector of length n for the exposure status of the n samples
`train_index`	numeric vector indcating the indices of `response` and the rows of `data` that are in the training set
`test_index`	numeric vector indcating the indices of `response` and the rows of `data` that are in the test set
`cluster_distance`	character representing which matrix from the training set that you want to use to cluster the genes. Must be one of the following corr, corr0, corr1, tom, tom0, tom1, diffcorr, difftom, corScor, tomScor, fisherScore
`eclust_distance`	character representing which matrix from the training set that you want to use to cluster the genes based on the environment. See `cluster_distance` for avaialble options. Should be different from `cluster_distance`. For example, if `cluster_distance=corr` and `EclustDistance=fisherScore`. That is, one should be based on correlations ignoring the environment, and the other should be based on correlations accounting for the environment. This function will always return this add on
`measure_distance`	one of "euclidean","maximum","manhattan", "canberra", "binary","minkowski" to be passed to `dist` function for calculating the distance for the clusters based on the corr,corr1,corr0, tom, tom0, tom1 matrices
`minimum_cluster_size`	The minimum cluster size. Only applicable if `cutMethod='dynamic'`. This argument is passed to the `cutreeDynamic` function through the `u_cluster_similarity` function. Default is 50.
`...`	arguments passed to the `u_cluster_similarity` function

Details

This function clusters the data. The results of this function should then be passed to the r_prepare_data function which output the appropriate X and Y matrices in the right format for regression packages such as mgcv, caret and glmnet

Value

a list of length 8:

clustersAddon: clustering results based on the environment and not the environment. see u_cluster_similarity for details
clustersAll: clustering results ignoring the environment. See u_cluster_similarity for details
etrain: vector of the exposure variable for the training set
cluster_distance_similarity: the similarity matrix based on the argument specified in cluster_distance
eclust_distance_similarity: the similarity matrix based on the argument specified in eclust_distance
clustersAddonMembership: a data.frame and data.table of the clustering membership for clustering results based on the environment and not the environment. As a result, each gene will show up twice in this table
clustersAllMembership: a data.frame and data.table of the clustering membership for clustering results based on all subjects i.e. ignoring the environment. Each gene will only show up once in this table
clustersEclustMembership: a data.frame and data.table of the clustering membership for clustering results accounting for the environment. Each gene will only show up once in this table

Examples

data("tcgaov")
tcgaov[1:5,1:6, with = FALSE]
Y <- log(tcgaov[["OS"]])
E <- tcgaov[["E"]]
genes <- as.matrix(tcgaov[,-c("OS","rn","subtype","E","status"),with = FALSE])
trainIndex <- drop(caret::createDataPartition(Y, p = 0.5, list = FALSE, times = 1))
testIndex <- setdiff(seq_len(length(Y)),trainIndex)

## Not run: 
cluster_res <- r_cluster_data(data = genes,
                              response = Y,
                              exposure = E,
                              train_index = trainIndex,
                              test_index = testIndex,
                              cluster_distance = "tom",
                              eclust_distance = "difftom",
                              measure_distance = "euclidean",
                              clustMethod = "hclust",
                              cutMethod = "dynamic",
                              method = "average",
                              nPC = 1,
                              minimum_cluster_size = 60)

# the number of clusters determined by the similarity matrices specified
# in the cluster_distance and eclust_distance arguments. This will always be larger
# than cluster_res$clustersAll$nclusters which is based on the similarity matrix
# specified in the cluster_distance argument
cluster_res$clustersAddon$nclusters

# the number of clusters determined by the similarity matrices specified
# in the cluster_distance argument only
cluster_res$clustersAll$nclusters

## End(Not run)

[Package eclust version 0.1.0 Index]