| mlr_pipeops_textvectorizer {mlr3pipelines} | R Documentation |
Bag-of-word Representation of Character Features
Description
Computes a bag-of-word representation from a (set of) columns.
Columns of type character are split up into words.
Uses the quanteda::dfm(),
quanteda::dfm_trim() from the 'quanteda' package.
TF-IDF computation works similarly to quanteda::dfm_tfidf()
but has been adjusted for train/test data split using quanteda::docfreq()
and quanteda::dfm_weight()
In short:
Per default, produces a bag-of-words representation
If
nis set to values > 1, ngrams are computedIf
df_trimparameters are set, the bag-of-words is trimmed.The
scheme_tfparameter controls term-frequency (per-document, i.e. per-row) weightingThe
scheme_dfparameter controls the document-frequency (per token, i.e. per-column) weighting.
Parameters specify arguments to quanteda's dfm, dfm_trim, docfreq and dfm_weight.
What belongs to what can be obtained from each params tags where tokenizer are
arguments passed on to quanteda::dfm().
Defaults to a bag-of-words representation with token counts as matrix entries.
In order to perform the default dfm_tfidf weighting, set the scheme_df parameter to "inverse".
The scheme_df parameter is initialized to "unary", which disables document frequency weighting.
The pipeop works as follows:
Words are tokenized using
quanteda::tokens.Ngrams are computed using
quanteda::tokens_ngramsA document-frequency matrix is computed using
quanteda::dfmThe document-frequency matrix is trimmed using
quanteda::dfm_trimduring train-time.The document-frequency matrix is re-weighted (similar to
quanteda::dfm_tfidf) ifscheme_dfis not set to"unary".
Format
R6Class object inheriting from PipeOpTaskPreproc/PipeOp.
Construction
PipeOpTextVectorizer$new(id = "textvectorizer", param_vals = list())
-
id::character(1)
Identifier of resulting object, default"textvectorizer". -
param_vals:: namedlist
List of hyperparameter settings, overwriting the hyperparameter settings that would otherwise be set during construction. Defaultlist().
Input and Output Channels
Input and output channels are inherited from PipeOpTaskPreproc.
The output is the input Task with all affected features converted to a bag-of-words
representation.
State
The $state is a list with element 'cols': A vector of extracted columns.
Parameters
The parameters are the parameters inherited from PipeOpTaskPreproc, as well as:
-
return_type::character(1)
Whether to return an integer representation ("integer-sequence") or a Bag-of-words ("bow"). If set to "integer_sequence", tokens are replaced by an integer and padded/truncated tosequence_length. If set to "factor_sequence", tokens are replaced by a factor and padded/truncated tosequence_length. If set to 'bow', a possibly weighted bag-of-words matrix is returned. Defaults tobow. -
stopwords_language::character(1)
Language to use for stopword filtering. Needs to be either"none", a language identifier listed instopwords::stopwords_getlanguages("snowball")("de","en", ...) or"smart"."none"disables language-specific stopwords."smart"coresponds tostopwords::stopwords(source = "smart"), which contains English stopwords and also removes one-character strings. Initialized to"smart".
-
extra_stopwords::character
Extra stopwords to remove. Must be acharactervector containing individual tokens to remove. Initialized tocharacter(0). Whennis set to values greater than 1, this can also contain stop-ngrams. -
tolower::logical(1)
Convert to lower case? Seequanteda::dfm. Default:TRUE. -
stem::logical(1)
Perform stemming? Seequanteda::dfm. Default:FALSE. -
what::character(1)
Tokenization splitter. Seequanteda::tokens. Default:word. -
remove_punct::logical(1)
Seequanteda::tokens. Default:FALSE. -
remove_url::logical(1)
Seequanteda::tokens. Default:FALSE. -
remove_symbols::logical(1)
Seequanteda::tokens. Default:FALSE. -
remove_numbers::logical(1)
Seequanteda::tokens. Default:FALSE. -
remove_separators::logical(1)
Seequanteda::tokens. Default:TRUE. -
split_hypens::logical(1)
Seequanteda::tokens. Default:FALSE. -
n::integer
Vector of ngram lengths. Seequanteda::tokens_ngrams. Initialized to 1, deviating from the base function's default. Note that this can be a vector of multiple values, to construct ngrams of multiple orders. -
skip::integer
Vector of skips. Seequanteda::tokens_ngrams. Default: 0. Note that this can be a vector of multiple values. -
sparsity::numeric(1)
Desired sparsity of the 'tfm' matrix. Seequanteda::dfm_trim. Default:NULL. -
max_termfreq::numeric(1)
Maximum term frequency in the 'tfm' matrix. Seequanteda::dfm_trim. Default:NULL. -
min_termfreq::numeric(1)
Minimum term frequency in the 'tfm' matrix. Seequanteda::dfm_trim. Default:NULL. -
termfreq_type::character(1)
How to asess term frequency. Seequanteda::dfm_trim. Default:"count". -
scheme_df::character(1)
Weighting scheme for document frequency: Seequanteda::docfreq. Initialized to"unary"(1 for each document, deviating from base function default). -
smoothing_df::numeric(1)
Seequanteda::docfreq. Default: 0. -
k_df::numeric(1)
kparameter given toquanteda::docfreq(see there). Default is 0. -
threshold_df::numeric(1)
Seequanteda::docfreq. Default: 0. Only considered forscheme_df="count". -
base_df::numeric(1)
The base for logarithms inquanteda::docfreq(see there). Default: 10. -
scheme_tf::character(1)
Weighting scheme for term frequency: Seequanteda::dfm_weight. Default:"count". -
k_tf::numeric(1)
kparameter given toquanteda::dfm_weight(see there). Default behaviour is 0.5. -
base_df::numeric(1)
The base for logarithms inquanteda::dfm_weight(see there). Default: 10.
#' * sequence_length :: integer(1)
The length of the integer sequence. Defaults to Inf, i.e. all texts are padded to the length
of the longest text. Only relevant for "return_type" : "integer_sequence"
Internals
See Description. Internally uses the quanteda package. Calls quanteda::tokens, quanteda::tokens_ngrams and quanteda::dfm. During training,
quanteda::dfm_trim is also called. Tokens not seen during training are dropped during prediction.
Methods
Only methods inherited from PipeOpTaskPreproc/PipeOp.
See Also
https://mlr-org.com/pipeops.html
Other PipeOps:
PipeOp,
PipeOpEnsemble,
PipeOpImpute,
PipeOpTargetTrafo,
PipeOpTaskPreproc,
PipeOpTaskPreprocSimple,
mlr_pipeops,
mlr_pipeops_boxcox,
mlr_pipeops_branch,
mlr_pipeops_chunk,
mlr_pipeops_classbalancing,
mlr_pipeops_classifavg,
mlr_pipeops_classweights,
mlr_pipeops_colapply,
mlr_pipeops_collapsefactors,
mlr_pipeops_colroles,
mlr_pipeops_copy,
mlr_pipeops_datefeatures,
mlr_pipeops_encode,
mlr_pipeops_encodeimpact,
mlr_pipeops_encodelmer,
mlr_pipeops_featureunion,
mlr_pipeops_filter,
mlr_pipeops_fixfactors,
mlr_pipeops_histbin,
mlr_pipeops_ica,
mlr_pipeops_imputeconstant,
mlr_pipeops_imputehist,
mlr_pipeops_imputelearner,
mlr_pipeops_imputemean,
mlr_pipeops_imputemedian,
mlr_pipeops_imputemode,
mlr_pipeops_imputeoor,
mlr_pipeops_imputesample,
mlr_pipeops_kernelpca,
mlr_pipeops_learner,
mlr_pipeops_missind,
mlr_pipeops_modelmatrix,
mlr_pipeops_multiplicityexply,
mlr_pipeops_multiplicityimply,
mlr_pipeops_mutate,
mlr_pipeops_nmf,
mlr_pipeops_nop,
mlr_pipeops_ovrsplit,
mlr_pipeops_ovrunite,
mlr_pipeops_pca,
mlr_pipeops_proxy,
mlr_pipeops_quantilebin,
mlr_pipeops_randomprojection,
mlr_pipeops_randomresponse,
mlr_pipeops_regravg,
mlr_pipeops_removeconstants,
mlr_pipeops_renamecolumns,
mlr_pipeops_replicate,
mlr_pipeops_scale,
mlr_pipeops_scalemaxabs,
mlr_pipeops_scalerange,
mlr_pipeops_select,
mlr_pipeops_smote,
mlr_pipeops_spatialsign,
mlr_pipeops_subsample,
mlr_pipeops_targetinvert,
mlr_pipeops_targetmutate,
mlr_pipeops_targettrafoscalerange,
mlr_pipeops_threshold,
mlr_pipeops_tunethreshold,
mlr_pipeops_unbranch,
mlr_pipeops_updatetarget,
mlr_pipeops_vtreat,
mlr_pipeops_yeojohnson
Examples
library("mlr3")
library("data.table")
# create some text data
dt = data.table(
txt = replicate(150, paste0(sample(letters, 3), collapse = " "))
)
task = tsk("iris")$cbind(dt)
pos = po("textvectorizer", param_vals = list(stopwords_language = "en"))
pos$train(list(task))[[1]]$data()
one_line_of_iris = task$filter(13)
one_line_of_iris$data()
pos$predict(list(one_line_of_iris))[[1]]$data()