stylest2_select_vocab {stylest2}R Documentation

Cross-validation based term selection

Description

K-fold cross validation to determine the optimal cutoff on the term frequency distribution under which to drop terms.

Usage

stylest2_select_vocab(
  dfm,
  smoothing = 0.5,
  cutoffs = c(50, 60, 70, 80, 90, 99),
  nfold = 5,
  terms = NULL,
  term_weights = NULL,
  fill = FALSE,
  fill_weight = NULL,
  suppress_warning = TRUE
)

Arguments

dfm

a quanteda dfm object.

smoothing

the smoothing parameter value for smoothing the dfm. Should be a numeric scalar, default to 0.5.

cutoffs

a numeric vector of cutoff candidates.

nfold

number of folds for the cross-validation

terms

If not NULL, terms to be used in the model. If NULL, use all terms.

term_weights

Named vector of distances (or any weights) per term in the vocab. Names should correspond to the term.

fill

Should missing values in term weights be filled? Defaults to FALSE.

fill_weight

Numeric value to fill in as weight for any term which does not have a weight specified in term_weights.

suppress_warning

TRUE/FALSE, indicate whether to suppress warnings from stylest2_fit().

Value

List of: best cutoff percent with the best speaker classification rate; cutoff percentages that were tested; matrix of the mean percentage of incorrectly identified speakers for each cutoff percent and fold; and the number of folds for cross-validation.

Examples

data(novels_dfm)
stylest2_select_vocab(dfm=novels_dfm)


[Package stylest2 version 0.1 Index]