MiFracData {MiDA} | R Documentation |
Subset an expression matrix based on probe's feature importance
Description
This function reduces the number of rows (probes) in gene/transcript expression matrix, leaving only those that have the biggest feature importance for binary classification.
Usage
MiFracData(Matrix, importance.list, NRows)
Arguments
Matrix |
numeric matrix of expression data where each row corresponds to a probe (gene, transcript), and each column correspondes to a specimen (patient). Probe IDs must be indicated as matrix row names. |
importance.list |
a list of data frames, containing the result of binary classification:
probe IDs in first column and probe's feature importance (relative influence) in the second column
in the order from most important to the least important for classification.
Such list is the |
NRows |
integer defines how many probes are to be left in the expression matrix. |
Details
Function provides gene expression matrix subsetting according to probe's feature importance for binary
classification, i.e., feature selection. Feature selection provides better classification and
identification of significant genes while "not important" genes are taken away from analysis.
The procedure of the pairwise combinations of the feature selection and classification methods are
described by Pirooznia et al (2008).
The function is able to use multiple feature importance data at a time to subset one expression matrix.
If importance.list
contains more than one data frame (i.e., the result of a binary classification
for more than one model created during cross-validation), the function selects most important probes
from each data frame and then removes the repeats.
Thus, the output matrix may contain number of probes more than NRows
.
Value
expression matrix with only selected probes in alphabetical order as rows and all specimens as columns.
Author(s)
Elena N. Filatova
References
Pirooznia M., Yang J.Y., Yang M.Q., Deng Y. (2008) A comparative study of different machine learning methods on microarray gene expression data. BMC Genomics 9 (Suppl1), S13. https://doi.org/10.1186/1471-2164-9-S1-S13
See Also
Examples
# get gene expression and specimen data
data("IMexpression");data("IMspecimen")
#sample expression matrix and specimen data for binary classification,
#only "NORM" and "EBV" specimens are left
SampleMatrix<-MiDataSample(IMexpression, IMspecimen$diagnosis,"norm", "ebv")
dim(SampleMatrix) # 100 probes
SampleSpecimen<-MiSpecimenSample(IMspecimen$diagnosis, "norm", "ebv")
#Fitting, low tuning for faster running
ClassRes<-MiBiClassGBODT(SampleMatrix, SampleSpecimen, n.crossval = 3,
ntrees = 10, shrinkage = 1, intdepth = 2)
# List of influence data frames for all 3 models build using cross-validation
# is the 2nd element of BiClassGBODT results
# take 10 most important probes from each model
Sample2Matrix<-MiFracData(SampleMatrix, importance.list = ClassRes[[2]], 10)
dim(Sample2Matrix) # less than 100 probes left