stat_filter {nestedcv}R Documentation

Univariate filter for binary classification with mixed predictor datatypes

Description

Univariate statistic filter for dataframes of predictors with mixed numeric and categorical datatypes. Different statistical tests are used depending on the data type of response vector and predictors:

Binary class response: bin_stat_filter()

t-test for continuous data, chi-squared test for categorical data

Multiclass response: class_stat_filter()

one-way ANOVA for continuous data, chi-squared test for categorical data

Continuous response: cor_stat_filter()

correlation (or linear regression) for continuous data and binary data, one-way ANOVA for categorical data

Usage

stat_filter(y, x, ...)

bin_stat_filter(
  y,
  x,
  force_vars = NULL,
  nfilter = NULL,
  p_cutoff = 0.05,
  rsq_cutoff = NULL,
  type = c("index", "names", "full", "list"),
  ...
)

class_stat_filter(
  y,
  x,
  force_vars = NULL,
  nfilter = NULL,
  p_cutoff = 0.05,
  rsq_cutoff = NULL,
  type = c("index", "names", "full", "list"),
  ...
)

cor_stat_filter(
  y,
  x,
  cor_method = c("pearson", "spearman", "lm"),
  force_vars = NULL,
  nfilter = NULL,
  p_cutoff = 0.05,
  rsq_cutoff = NULL,
  rsq_method = "pearson",
  type = c("index", "names", "full", "list"),
  ...
)

Arguments

y

Response vector

x

Matrix or dataframe of predictors

...

optional arguments, e.g. rsq_method: see collinear().

force_vars

Vector of column names within x which are always retained in the model (i.e. not filtered). Default NULL means all predictors will be passed to filterFUN.

nfilter

Number of predictors to return. If NULL all predictors with p-values < p_cutoff are returned.

p_cutoff

p value cut-off

rsq_cutoff

r^2 cutoff for removing predictors due to collinearity. Default NULL means no collinearity filtering. Predictors are ranked based on t-test. If 2 or more predictors are collinear, the first ranked predictor by t-test is retained, while the other collinear predictors are removed. See collinear().

type

Type of vector returned. Default "index" returns indices, "names" returns predictor names, "full" returns a dataframe of statistics, "list" returns a list of 2 matrices of statistics, one for continuous predictors, one for categorical predictors.

cor_method

For cor_stat_filter() only, either "pearson", "spearman" or "lm" controlling whether continuous predictors are filtered by correlation (faster) or regression (slower but allows inclusion of covariates via force_vars).

rsq_method

character string indicating which correlation coefficient is to be computed. One of "pearson" (default), "kendall", or "spearman". See collinear().

Details

stat_filter() is a wrapper which calls bin_stat_filter(), class_stat_filter() or cor_stat_filter() depending on whether y is binary, multiclass or continuous respectively. Ordered factors are converted to numeric (integer) levels and analysed as if continuous.

Value

Integer vector of indices of filtered parameters (type = "index") or character vector of names (type = "names") of filtered parameters in order of test p-value. If type is "full" full output is returned containing a dataframe of statistical results. If type is "list" the output is returned as a list of 2 matrices containing statistical results separated by continuous and categorical predictors.

Examples

library(mlbench)
data(BostonHousing2)
dat <- BostonHousing2
y <- dat$cmedv  ## continuous outcome
x <- subset(dat, select = -c(cmedv, medv, town))

stat_filter(y, x, type = "full")
stat_filter(y, x, nfilter = 5, type = "names")
stat_filter(y, x)

data(iris)
y <- iris$Species  ## 3 class outcome
x <- subset(iris, select = -Species)
stat_filter(y, x, type = "full")


[Package nestedcv version 0.7.9 Index]