stylest2_select_vocab {stylest2} | R Documentation |
Cross-validation based term selection
Description
K-fold cross validation to determine the optimal cutoff on the term frequency distribution under which to drop terms.
Usage
stylest2_select_vocab(
dfm,
smoothing = 0.5,
cutoffs = c(50, 60, 70, 80, 90, 99),
nfold = 5,
terms = NULL,
term_weights = NULL,
fill = FALSE,
fill_weight = NULL,
suppress_warning = TRUE
)
Arguments
dfm |
a quanteda |
smoothing |
the smoothing parameter value for smoothing the dfm. Should be a numeric scalar, default to 0.5. |
cutoffs |
a numeric vector of cutoff candidates. |
nfold |
number of folds for the cross-validation |
terms |
If not |
term_weights |
Named vector of distances (or any weights) per term in the vocab. Names should correspond to the term. |
fill |
Should missing values in term weights be filled? Defaults to FALSE. |
fill_weight |
Numeric value to fill in as weight for any term which does
not have a weight specified in |
suppress_warning |
TRUE/FALSE, indicate whether to suppress warnings from
|
Value
List of: best cutoff percent with the best speaker classification rate; cutoff percentages that were tested; matrix of the mean percentage of incorrectly identified speakers for each cutoff percent and fold; and the number of folds for cross-validation.
Examples
data(novels_dfm)
stylest2_select_vocab(dfm=novels_dfm)