VSS {corpora}R Documentation

A small corpus of very short stories with linguistic annotations


This data set contains a small corpus (8043 tokens) of short stories from the collection Very Short Stories (VSS, see http://www.schtepf.de/History/pages/stories.html). The text was automatically segmented (tokenised) and annotated with part-of-speech tags (from the Penn tagset) and lemmas (base forms), using the IMS TreeTagger (Schmid 1994) and a custom lemmatizer.




A data set with 8043 rows corresponding to tokens and the following columns:


the word form (or surface form) of the token


the part-of-speech tag of the token (Penn tagset)


the lemma (or base form) of the token


number of the sentence in which the token occurs (integer)


title of the story to which the token belongs (factor)


The Penn tagset defines the following part-of-speech tags:

CC Coordinating conjunction
CD Cardinal number
DT Determiner
EX Existential there
FW Foreign word
IN Preposition or subordinating conjunction
JJ Adjective
JJR Adjective, comparative
JJS Adjective, superlative
LS List item marker
MD Modal
NN Noun, singular or mass
NNS Noun, plural
NP Proper noun, singular
NPS Proper noun, plural
PDT Predeterminer
POS Possessive ending
PP Personal pronoun
PP$ Possessive pronoun
RB Adverb
RBR Adverb, comparative
RBS Adverb, superlative
RP Particle
SYM Symbol
TO to
UH Interjection
VB Verb, base form
VBD Verb, past tense
VBG Verb, gerund or present participle
VBN Verb, past participle
VBP Verb, non-3rd person singular present
VBZ Verb, 3rd person singular present
WDT Wh-determiner
WP Wh-pronoun
WP$ Possessive wh-pronoun
WRB Wh-adverb


Stephanie Evert (Rlhttps://purl.org/stephanie.evert)


Schmid, Helmut (1994). Probabilistic part-of-speech tagging using decision trees. In: Proceedings of the International Conference on New Methods in Language Processing (NeMLaP), pages 44-49.

[Package corpora version 0.5-1 Index]