types {koRpus} | R Documentation |
Get types and tokens of a given text
Description
These methods return character vectors that return all types or tokens of a given text,
where text can either be a character
vector itself, a previosly tokenized/tagged koRpus object,
or an object of class kRp.TTR
.
Usage
types(txt, ...)
tokens(txt, ...)
## S4 method for signature 'kRp.TTR'
types(txt, stats = FALSE)
## S4 method for signature 'kRp.TTR'
tokens(txt)
## S4 method for signature 'kRp.text'
types(
txt,
case.sens = FALSE,
lemmatize = FALSE,
corp.rm.class = "nonpunct",
corp.rm.tag = c(),
stats = FALSE
)
## S4 method for signature 'kRp.text'
tokens(
txt,
case.sens = FALSE,
lemmatize = FALSE,
corp.rm.class = "nonpunct",
corp.rm.tag = c()
)
## S4 method for signature 'character'
types(
txt,
case.sens = FALSE,
lemmatize = FALSE,
corp.rm.class = "nonpunct",
corp.rm.tag = c(),
stats = FALSE,
lang = NULL
)
## S4 method for signature 'character'
tokens(
txt,
case.sens = FALSE,
lemmatize = FALSE,
corp.rm.class = "nonpunct",
corp.rm.tag = c(),
lang = NULL
)
Arguments
txt |
An object of either class |
... |
Only used for the method generic. |
stats |
Logical, whether statistics on the length in characters and frequency of types in the text should also be returned. |
case.sens |
Logical, whether types should be counted case sensitive. This option is available for tagged text and character input only. |
lemmatize |
Logical, whether analysis should be carried out on the lemmatized tokens rather than all running word forms. This option is available for tagged text and character input only. |
corp.rm.class |
A character vector with word classes which should be dropped. The default value
|
corp.rm.tag |
A character vector with POS tags which should be dropped. This option is available for tagged text and character input only. |
lang |
Set the language of a text,
see the |
Value
A character vector. Fortypes
and stats=TRUE
a data.frame containing all types,
their length (characters)
and frequency. The types
result is always sorted by frequency,
with more frequent types coming first.
Note
If the input is of class kRp.TTR
,
the result will only be useful if lex.div
or
the respective wrapper function was called with keep.tokens=TRUE
. Similarily,
lemmatize
can only work
properly if the input is a tagged text object with lemmata or you've properly set up the enviroment via set.kRp.env
.
Calling these methods on kRp.TTR
objects is just returning the respective part of its tt
slot.
See Also
kRp.POS.tags
,
kRp.text
,
kRp.TTR
,
lex.div
Examples
# code is only run when the english language package can be loaded
if(require("koRpus.lang.en", quietly = TRUE)){
sample_file <- file.path(
path.package("koRpus"), "examples", "corpus", "Reality_Winner.txt"
)
tokenized.obj <- tokenize(
txt=sample_file,
lang="en"
)
types(tokenized.obj)
tokens(tokenized.obj)
} else {}