R: Computes the scores of the features.

FeaLect {FeaLect}

R Documentation

Computes the scores of the features.

Description

Several random subsets are sampled from the input data and for each random subset, various linear models are fitted using lars method. A score is assigned to each feature based on the tendency of LASSO in including that feature in the models. Finally, the average score and the models are returned as the output.

Usage

FeaLect(F, L, maximum.features.num = dim(F)[2], total.num.of.models, gamma = 3/4, 
	   persistence = 1000, talk = FALSE, minimum.class.size = 2, 
	   report.fitting.failure = FALSE, return_linear.models = TRUE, balance = TRUE,
	   replace = TRUE, plot.scores = TRUE)

Arguments

`F`	The feature matrix, each column is a feature.
`L`	The vector of labels named according to the rows of F.
`maximum.features.num`	Upto this number of features are allowed to contribute to each linear model.
`total.num.of.models`	The total number of models that are fitted.
`gamma`	A value in range 0-1 that determines the relative size of sample subsets.
`persistence`	Maximum number of tries for randomly choosing.samples, If we try this many times and the obtained labels are all the same, we give up (maybe the whole labels are the same) with the error message: " Not enough variation in the labels...".
`talk`	If TRUE, some messages are printed during the computations.
`minimum.class.size`	The size of both positive and negative classes should be greater than this threshold after sampling.
`report.fitting.failure`	If TRUE, any failure in fitting the linear of logistic models will be printed.
`return_linear.models`	The models are memory intensive, so for if they more than 1000, we may decide to ignore them to prevent memory outage.
`balance`	If TRUE, the cases will be balanced for the same number of positive vs. negatives by oversampling before fitting the linear model.
`replace`	If TRUE, the subsets are sampled with replacement.
`plot.scores`	If TRUE, the scores are plotted in logarithmic scale after each iteration.

Details

See the reference for more details.

Value

Returns a list of:

`log.scores`	A vector containing the logarithm of final scores.
`feature.matrix`	The input feature matrix.
`labels`	The input labels
`total.num.of.models`	The total number of models that are fitted.
`maximum.features.num`	Upto this number of features are allowed to contribute to each linear model.
`feature.scores.history`	The matrix of history of feature scores where column i contains the scores after i runs.
`num.of.features.score`	A vector, entry i contains the number of times that i has been the best number of features.
`best.feature.num`	The i'th value of this vector is the best number of features for the i'th model.
`mislabeling.record`	A vector that keeps track of the frequency of mislabelling for each cases.
`doctors`	List of all models which are created by train.doctor() function.
`best.features.intersection`	Best features are computed for each sampling and their intersection is reported as this vector of features names
`features.with.best.global.error`	A list containing the sets of features. The set i was the best for i'th sampling.
`time.taken`	Total time used for executing this function.

Note

Logistic regression is also done on top of fitting the linear models.

Author(s)

Habil Zare

References

"Statistical Analysis of Overfitting Features", manuscript in preparation.

Examples

library(FeaLect)
data(mcl_sll)
F <- as.matrix(mcl_sll[ ,-1])	# The Feature matrix
L <- as.numeric(mcl_sll[ ,1])	# The labels
names(L) <- rownames(F)
message(dim(F)[1], " samples and ",dim(F)[2], " features.")

## For this data, total.num.of.models is suggested to be at least 100.
FeaLect.result <-FeaLect(F=F,L=L,maximum.features.num=10,total.num.of.models=20,talk=TRUE)

[Package FeaLect version 1.20 Index]