R: Subset an expression matrix based on probe's feature...

MiFracData {MiDA}

R Documentation

Subset an expression matrix based on probe's feature importance

Description

This function reduces the number of rows (probes) in gene/transcript expression matrix, leaving only those that have the biggest feature importance for binary classification.

Usage

MiFracData(Matrix, importance.list, NRows)

Arguments

`Matrix`	numeric matrix of expression data where each row corresponds to a probe (gene, transcript), and each column correspondes to a specimen (patient). Probe IDs must be indicated as matrix row names.
`importance.list`	a list of data frames, containing the result of binary classification: probe IDs in first column and probe's feature importance (relative influence) in the second column in the order from most important to the least important for classification. Such list is the `MiBiClassGBODT` output (`Importance`).
`NRows`	integer defines how many probes are to be left in the expression matrix.

Details

Function provides gene expression matrix subsetting according to probe's feature importance for binary classification, i.e., feature selection. Feature selection provides better classification and identification of significant genes while "not important" genes are taken away from analysis. The procedure of the pairwise combinations of the feature selection and classification methods are described by Pirooznia et al (2008).
The function is able to use multiple feature importance data at a time to subset one expression matrix. If importance.list contains more than one data frame (i.e., the result of a binary classification for more than one model created during cross-validation), the function selects most important probes from each data frame and then removes the repeats. Thus, the output matrix may contain number of probes more than NRows.

Value

expression matrix with only selected probes in alphabetical order as rows and all specimens as columns.

Author(s)

Elena N. Filatova

References

Pirooznia M., Yang J.Y., Yang M.Q., Deng Y. (2008) A comparative study of different machine learning methods on microarray gene expression data. BMC Genomics 9 (Suppl1), S13. https://doi.org/10.1186/1471-2164-9-S1-S13

Examples

# get gene expression and specimen data
data("IMexpression");data("IMspecimen")
#sample expression matrix and specimen data for binary classification,
#only "NORM" and "EBV" specimens are left
SampleMatrix<-MiDataSample(IMexpression, IMspecimen$diagnosis,"norm", "ebv")
dim(SampleMatrix) # 100 probes
SampleSpecimen<-MiSpecimenSample(IMspecimen$diagnosis, "norm", "ebv")
#Fitting, low tuning for faster running
ClassRes<-MiBiClassGBODT(SampleMatrix, SampleSpecimen, n.crossval = 3,
                         ntrees = 10, shrinkage = 1, intdepth = 2)
# List of influence data frames for all 3 models build using cross-validation
# is the 2nd element of BiClassGBODT results
# take 10 most important probes from each model
Sample2Matrix<-MiFracData(SampleMatrix, importance.list = ClassRes[[2]], 10)
dim(Sample2Matrix) # less than 100 probes left

[Package MiDA version 0.1.2 Index]