R: Permutation test for feature selection

test_features {biogram}

R Documentation

Permutation test for feature selection

Description

Performs a feature selection on positioned n-gram data using a Fisher's permutation test.

Usage

test_features(
  target,
  features,
  criterion = "ig",
  adjust = "BH",
  threshold = 1,
  quick = TRUE,
  times = 1e+05
)

Arguments

`target`	`integer` vector with target information (e.g. class labels).
`features`	`integer` matrix of features with number of rows equal to the length of the target vector.
`criterion`	criterion used in permutation test. See Details for the list of possible criterions.
`adjust`	name of p-value adjustment method. See `p.adjust` for the list of possible values. If `NULL`, p-values are not adjusted.
`threshold`	`integer`. Features that occur less than `threshold` and more often than `nrow(features)-threshold` are discarded from the permutation test.
`quick`	`logical`, if `TRUE` Quick Permutation Test (QuiPT) is used. If `FALSE`, normal permutation test is performed.
`times`	number of times procedure should be repeated. Ignored if `quick` is `TRUE`.

Details

Since the procedure involves multiple testing, it is advisable to use one of the avaible p-value adjustment methods. Such methods can be used directly by specifying the adjust parameter.

Available criterions:

ig: Information Gain: calc_ig.
kl: Kullback-Leibler divergence: calc_kl.
cs: Chi-squared-based measure: calc_cs.

Value

an object of class feature_test.

Note

Both target and features must be binary, i.e. contain only 0 and 1 values.

Features occuring too often and too rarely are considered not informative and may be removed using the threshold parameter.

References

Radivojac P, Obradovic Z, Dunker AK, Vucetic S, Feature selection filters based on the permutation test in Machine Learning: ECML 2004, 15th European Conference on Machine Learning, Springer, 2004.

Examples

# significant feature
tar_feat1 <- create_feature_target(10, 390, 0, 600) 
# significant feature
tar_feat2 <- create_feature_target(9, 391, 1, 599)
# insignificant feature
tar_feat3 <- create_feature_target(198, 202, 300, 300)
test_res <- test_features(tar_feat1[, 1], cbind(tar_feat1[, 2], tar_feat2[, 2], 
                          tar_feat3[, 2]))
summary(test_res)
cut(test_res)

# real data example
# we will analyze only a subsample of a dataset to make analysis quicker
ids <- c(1L:100, 701L:800)
deg_seqs <- degenerate(human_cleave[ids, 1L:9], 
                       list(`a` = c(1, 6, 8, 10, 11, 18), 
                            `b` = c(2, 5, 13, 14, 16, 17, 19, 20), 
                            `c` = c(3, 4, 7, 9, 12, 15)))

# positioned n-grams example
bigrams_pos <- count_ngrams(deg_seqs, 2, letters[1L:3], pos = TRUE)
test_features(human_cleave[ids, 10], bigrams_pos)

# unpositioned n-grams example, binarization required
bigrams_notpos <- count_ngrams(deg_seqs, 2, letters[1L:3], pos = TRUE)
test_features(human_cleave[ids, 10], binarize(bigrams_notpos))

[Package biogram version 1.6.3 Index]