data_corpus_EPcoaldebate {quanteda.textmodels} | R Documentation |
Crowd-labelled sentence corpus from a 2010 EP debate on coal subsidies
Description
A multilingual text corpus of speeches from a European Parliament debate on coal subsidies in 2010, with individual crowd codings as the unit of observation. The sentences are drawn from officially translated speeches from a debate over a European Parliament debate concerning a Commission report proposing an extension to a regulation permitting state aid to uncompetitive coal mines.
Each speech is available in six languages: English, German, Greek, Italian, Polish and Spanish. The unit of observation is the individual crowd coding of each natural sentence. For more information on the coding approach see Benoit et al. (2016).
Usage
data_corpus_EPcoaldebate
Format
The corpus consists of 16,806 documents (i.e. codings of a sentence) and includes the following document-level variables:
- sentence_id
character; a unique identifier for each sentence
- crowd_subsidy_label
factor; whether a coder labelled the sentence as "Pro-Subsidy", "Anti-Subsidy" or "Neutral or inapplicable"
- language
factor; the language (translation) of the speech
- name_last
character; speaker's last name
- name_first
character; speaker's first name
- ep_group
factor; abbreviation of the EP party group of the speaker
- country
factor; the speaker's country of origin
- vote
factor; the speaker's vote on the proposal (For/Against/Abstain/NA)
- coder_id
character; a unique identifier for each crowd coder
- coder_trust
numeric; the "trust score" from the Crowdflower platform used to code the sentences, which can theoretically range between 0 and 1. Only coders with trust scores above 0.8 are included in the corpus.
A corpus object.
References
Benoit, K., Conway, D., Lauderdale, B.E., Laver, M., & Mikhaylov, S. (2016). Crowd-sourced Text Analysis: Reproducible and Agile Production of Political Data. American Political Science Review, 100,(2), 278–295. doi:10.1017/S0003055416000058