select_genes {EMMIXgene}R Documentation

Selects genes using the EMMIXgene algorithm.

Description

Follows the gene selection methodology of G. J. McLachlan, R. W. Bean, D. Peel; A mixture model-based approach to the clustering of microarray expression data , Bioinformatics, Volume 18, Issue 3, 1 March 2002, Pages 413–422, https://doi.org/10.1093/bioinformatics/18.3.413

Usage

select_genes(
  dat,
  filename,
  random_starts = 4,
  max_it = 100,
  ll_thresh = 8,
  min_clust_size = 8,
  tol = 1e-04,
  start_method = "both",
  three = FALSE
)

Arguments

dat

A matrix or dataframe containing gene expression data. Rows are genes and columns are samples. Must supply one of filename and dat.

filename

Name of file containing gene data. Can be either .csv or space separated .dat. Rows are genes and columns are samples. Must supply one of filename and dat.

random_starts

The number of random initializations used per gene when fitting mixtures of t-distributions. Initialization uses k-means by default.

max_it

The maximum number of iterations per mixture fit. Default value is 100.

ll_thresh

The difference in -2 log lambda used as a threshold for selecting between g=1 and g=2 for each gene. Default value is 8, which was chosen arbitrarily in the original paper.

min_clust_size

The minimum number of observations per cluster used when fitting mixtures of t-distributions for each gene. Default value is 8.

tol

Tolerance value used for detecting convergence of EMMIX fits.

start_method

Default value is "both". Can also choose "random" for purely random starts.

three

Also test g=2 vs g=3 where appropriate. Defaults to FALSE.

Value

An EMMIXgene object containing:

stat

The difference in log-likelihood for g=1 and g=2 for each gene (or for g=2 and g=3 where relevant).

g

The selected number of components for each gene.

it

The number of iterations for each genes selected fit.

selected

An indicator for each genes selected status

ranks

selected gene ids ranked by stat

genes

A dataframe of selected genes.

all_genes

Returns dat or contents of filename.

Examples

#only run on first 100 genes for speed
alon_sel <- select_genes(alon_data[seq_len(100), ]) 


[Package EMMIXgene version 0.1.4 Index]