R: Cross-validation based term selection

stylest2_select_vocab {stylest2}

R Documentation

Cross-validation based term selection

Description

K-fold cross validation to determine the optimal cutoff on the term frequency distribution under which to drop terms.

Usage

stylest2_select_vocab(
  dfm,
  smoothing = 0.5,
  cutoffs = c(50, 60, 70, 80, 90, 99),
  nfold = 5,
  terms = NULL,
  term_weights = NULL,
  fill = FALSE,
  fill_weight = NULL,
  suppress_warning = TRUE
)

Arguments

`dfm`	a quanteda `dfm` object.
`smoothing`	the smoothing parameter value for smoothing the dfm. Should be a numeric scalar, default to 0.5.
`cutoffs`	a numeric vector of cutoff candidates.
`nfold`	number of folds for the cross-validation
`terms`	If not `NULL`, terms to be used in the model. If `NULL`, use all terms.
`term_weights`	Named vector of distances (or any weights) per term in the vocab. Names should correspond to the term.
`fill`	Should missing values in term weights be filled? Defaults to FALSE.
`fill_weight`	Numeric value to fill in as weight for any term which does not have a weight specified in `term_weights`.
`suppress_warning`	TRUE/FALSE, indicate whether to suppress warnings from `stylest2_fit()`.

Value

List of: best cutoff percent with the best speaker classification rate; cutoff percentages that were tested; matrix of the mean percentage of incorrectly identified speakers for each cutoff percent and fold; and the number of folds for cross-validation.

Examples

data(novels_dfm)
stylest2_select_vocab(dfm=novels_dfm)

[Package stylest2 version 0.1 Index]