convert.lemma {wordspace} | R Documentation |
Transform CWB/Penn-Style Lemmas into Other Notation Formats (wordspace)
Description
Transform POS-disambiguated lemma strings in CWB/Penn format (see Details) into several other notation formats.
Usage
convert.lemma(lemma, format=c("CWB", "BNC", "DM", "HW", "HWLC"), hw.tolower=FALSE)
Arguments
lemma |
a character vector specifying one or more POS-disambiguated lemmas in CWB/Penn notation |
format |
the notation format to be generated (see Details) |
hw.tolower |
convert headword part to lowercase, regardless of output format |
Details
Input strings must be POS-disambiguated lemmas in CWB/Penn notation, i.e. in the form
<headword>_<P>
where <headword>
is a dictionary headword (which may be case-sensitive) and <P>
is
a one-letter code specifying the simple part of speech. Standard POS codes are
N ... nouns Z ... proper nouns V ... lexical and auxiliary verbs J ... adjectives R ... adverbs I ... prepositions (including all uses of "to") D ... determiners . ... punctuation
For other parts of speech, the first character of the corresponding Penn tag may be used. Note that these codes are not standardised and are only useful for distinguishing between content words and function words.
The following output formats are supported:
CWB
-
returns input strings without modifications, but validates that they are in CWB/Penn format
BNC
-
BNC-style POS-disambiguated lemmas based on the simplified CLAWS tagset. The headword part of the lemma is unconditionally converted to lowercase. The standard POS codes listed above are translated into
SUBST
(nouns and proper nouns),VERB
(verbs),ADJ
(adjectives),ADV
(adverbs),ART
(determiners),PREP
(prepositions), andSTOP
(punctuation). Other POS codes have no direct CLAWS equivalents and are mapped toUNC
(unclassified), so the transformation should only be used for the categories listed above. DM
-
POS-disambiguated lemmas in the format used by Distributional Memory (Baroni & Lenci 2010), viz.
<headword>-<p>
with POS code in lowercase and headword in its original capitalisation. For example,light_N
will be mapped tolight-n
. HW
-
just the undisambiguated headword
HWLC
-
undisambiguated headword mapped to lowercase (same as
HW
withhw.tolower=TRUE
)
Value
A character vector of the same length as lemma
, containing the transformed lemmas.
See Details above for the different output formats.
Author(s)
Stephanie Evert (https://purl.org/stephanie.evert)
References
Baroni, Marco and Lenci, Alessandro (2010). Distributional Memory: A general framework for corpus-based semantics. Computational Linguistics, 36(4), 673–712.
Examples
convert.lemma(RG65$word1, "CWB") # original format
convert.lemma(RG65$word1, "BNC") # BNC-style (simple CLAWS tags)
convert.lemma(RG65$word1, "DM") # as in Distributional Memory
convert.lemma(RG65$word1, "HW") # just the headword