mlr_pipeops_textvectorizer {mlr3pipelines}R Documentation

Bag-of-word Representation of Character Features

Description

Computes a bag-of-word representation from a (set of) columns. Columns of type character are split up into words. Uses the quanteda::dfm(), quanteda::dfm_trim() from the 'quanteda' package. TF-IDF computation works similarly to quanteda::dfm_tfidf() but has been adjusted for train/test data split using quanteda::docfreq() and quanteda::dfm_weight()

In short:

Parameters specify arguments to quanteda's dfm, dfm_trim, docfreq and dfm_weight. What belongs to what can be obtained from each params tags where tokenizer are arguments passed on to quanteda::dfm(). Defaults to a bag-of-words representation with token counts as matrix entries.

In order to perform the default dfm_tfidf weighting, set the scheme_df parameter to "inverse". The scheme_df parameter is initialized to "unary", which disables document frequency weighting.

The pipeop works as follows:

  1. Words are tokenized using quanteda::tokens.

  2. Ngrams are computed using quanteda::tokens_ngrams

  3. A document-frequency matrix is computed using quanteda::dfm

  4. The document-frequency matrix is trimmed using quanteda::dfm_trim during train-time.

  5. The document-frequency matrix is re-weighted (similar to quanteda::dfm_tfidf) if scheme_df is not set to "unary".

Format

R6Class object inheriting from PipeOpTaskPreproc/PipeOp.

Construction

PipeOpTextVectorizer$new(id = "textvectorizer", param_vals = list())

Input and Output Channels

Input and output channels are inherited from PipeOpTaskPreproc.

The output is the input Task with all affected features converted to a bag-of-words representation.

State

The ⁠$state⁠ is a list with element 'cols': A vector of extracted columns.

Parameters

The parameters are the parameters inherited from PipeOpTaskPreproc, as well as:

#' * sequence_length :: integer(1)
The length of the integer sequence. Defaults to Inf, i.e. all texts are padded to the length of the longest text. Only relevant for "return_type" : "integer_sequence"

Internals

See Description. Internally uses the quanteda package. Calls quanteda::tokens, quanteda::tokens_ngrams and quanteda::dfm. During training, quanteda::dfm_trim is also called. Tokens not seen during training are dropped during prediction.

Methods

Only methods inherited from PipeOpTaskPreproc/PipeOp.

See Also

https://mlr-org.com/pipeops.html

Other PipeOps: PipeOp, PipeOpEnsemble, PipeOpImpute, PipeOpTargetTrafo, PipeOpTaskPreproc, PipeOpTaskPreprocSimple, mlr_pipeops, mlr_pipeops_boxcox, mlr_pipeops_branch, mlr_pipeops_chunk, mlr_pipeops_classbalancing, mlr_pipeops_classifavg, mlr_pipeops_classweights, mlr_pipeops_colapply, mlr_pipeops_collapsefactors, mlr_pipeops_colroles, mlr_pipeops_copy, mlr_pipeops_datefeatures, mlr_pipeops_encode, mlr_pipeops_encodeimpact, mlr_pipeops_encodelmer, mlr_pipeops_featureunion, mlr_pipeops_filter, mlr_pipeops_fixfactors, mlr_pipeops_histbin, mlr_pipeops_ica, mlr_pipeops_imputeconstant, mlr_pipeops_imputehist, mlr_pipeops_imputelearner, mlr_pipeops_imputemean, mlr_pipeops_imputemedian, mlr_pipeops_imputemode, mlr_pipeops_imputeoor, mlr_pipeops_imputesample, mlr_pipeops_kernelpca, mlr_pipeops_learner, mlr_pipeops_missind, mlr_pipeops_modelmatrix, mlr_pipeops_multiplicityexply, mlr_pipeops_multiplicityimply, mlr_pipeops_mutate, mlr_pipeops_nmf, mlr_pipeops_nop, mlr_pipeops_ovrsplit, mlr_pipeops_ovrunite, mlr_pipeops_pca, mlr_pipeops_proxy, mlr_pipeops_quantilebin, mlr_pipeops_randomprojection, mlr_pipeops_randomresponse, mlr_pipeops_regravg, mlr_pipeops_removeconstants, mlr_pipeops_renamecolumns, mlr_pipeops_replicate, mlr_pipeops_scale, mlr_pipeops_scalemaxabs, mlr_pipeops_scalerange, mlr_pipeops_select, mlr_pipeops_smote, mlr_pipeops_spatialsign, mlr_pipeops_subsample, mlr_pipeops_targetinvert, mlr_pipeops_targetmutate, mlr_pipeops_targettrafoscalerange, mlr_pipeops_threshold, mlr_pipeops_tunethreshold, mlr_pipeops_unbranch, mlr_pipeops_updatetarget, mlr_pipeops_vtreat, mlr_pipeops_yeojohnson

Examples



library("mlr3")
library("data.table")
# create some text data
dt = data.table(
  txt = replicate(150, paste0(sample(letters, 3), collapse = " "))
)
task = tsk("iris")$cbind(dt)

pos = po("textvectorizer", param_vals = list(stopwords_language = "en"))

pos$train(list(task))[[1]]$data()

one_line_of_iris = task$filter(13)

one_line_of_iris$data()

pos$predict(list(one_line_of_iris))[[1]]$data()



[Package mlr3pipelines version 0.5.2 Index]