R: Fits various models based on a combination on penalized...

train.doctor {FeaLect}

R Documentation

Fits various models based on a combination on penalized linear models and logistic regression.

Description

Various linear models are fitted to the training samples using lars method. The models differ in the number of features and each is validated by validating samples. A score is also assigned to each feature based on the tendency of LASSO in including that feature in the models.

Usage

train.doctor(F_, L_, training.samples, validating.samples, considered.features, 
		 maximum.features.num, balance = TRUE, return_linear.models = TRUE, 
		 report.fitting.failure = FALSE)

Arguments

`F_`	The feature matrix, each column is a feature.
`L_`	The vector of labels named according to the rows of F.
`training.samples`	The names of rows of F that should be considered as training samples.
`validating.samples`	The names of rows of F that should be considered as validating samples.
`considered.features`	The names of columns of F that determine the features of interest.
`maximum.features.num`	Upto this number of features are allowed to contribute to each linear model.
`balance`	If TRUE, the cases will be balanced for the same number of positive vs. negatives by oversampling before fitting the linear model.
`return_linear.models`	The models are memory intensive, so for if they more than 1000, we may decide to ignore them to prevent memory outage.
`report.fitting.failure`	If TRUE, any failure in fitting the linear of logistic models will be printed.

Details

See the reference for more details.

Value

Returns a list of:

`linear.models`	The result of model fitting computed by lars().
`best.number.of.features`	According to best accuracy.
`probabilities`	The best computed logistic score.
`accuracy`	The best F-measure.
`best.logistic.cof`	According to best accuracy.
`contribution.to.feature.scores`	This vector should be added to the total feature scores.
`contribution.to.feature.scores.frequency`	This vector should be added to the total frequency of features.
`training.samples`	Input, the names of rows of F that should be considered as training samples.
`validating.samples`	Input, the names of rows of F that should be considered as validating samples.
`precision`	Ratio of number of true positives to predicted positives.
`recall`	Ratio of number of true positives to real positives.
`selected.features.sequence`	A list of sets of features which are selected in different models.
`global.errors`	A vector of global error of the linear fits.
`features.with.best.global.error`	A vector of names of good features in terms of global error of linear fits.

Note

Logistic regression is also done on top of fitting the linear models.

Author(s)

Habil Zare

References

"Statistical Analysis of Overfitting Features", manuscript in preparation.

Examples

library(FeaLect)
data(mcl_sll)
F <- as.matrix(mcl_sll[ ,-1])	# The Feature matrix
L <- as.numeric(mcl_sll[ ,1])	# The labels
names(L) <- rownames(F)
message(dim(F)[1], " samples and ",dim(F)[2], " features.")

all.samples <- rownames(F); ts <- all.samples[5:10]; vs <- all.samples[c(1,22)]

doctor <- train.doctor(F_=F, L_=L, training.samples=ts, validating.samples=vs,
       considered.features=colnames(F), maximum.features.num=10)

[Package FeaLect version 1.20 Index]