conText {conText} | R Documentation |
Embedding regression
Description
Estimates an embedding regression model with options to use bootstrapping to estimate confidence intervals and a permutation test for inference (see https://github.com/prodriguezsosa/conText for details.)
Usage
conText(
formula,
data,
pre_trained,
transform = TRUE,
transform_matrix,
bootstrap = TRUE,
num_bootstraps = 100,
confidence_level = 0.95,
stratify = FALSE,
permute = TRUE,
num_permutations = 100,
window = 6L,
valuetype = c("glob", "regex", "fixed"),
case_insensitive = TRUE,
hard_cut = FALSE,
verbose = TRUE
)
Arguments
formula |
a symbolic description of the model to be fitted with a target word as a DV e.g.
|
data |
a quanteda |
pre_trained |
(numeric) a F x D matrix corresponding to pretrained embeddings. F = number of features and D = embedding dimensions. rownames(pre_trained) = set of features for which there is a pre-trained embedding. |
transform |
(logical) if TRUE (default) apply the 'a la carte' transformation, if FALSE ouput untransformed averaged embeddings. |
transform_matrix |
(numeric) a D x D 'a la carte' transformation matrix. D = dimensions of pretrained embeddings. |
bootstrap |
(logical) if TRUE, use bootstrapping – sample from texts with replacement and re-run regression on each sample. Required to get std. errors. |
num_bootstraps |
(numeric) number of bootstraps to use (at least 100) |
confidence_level |
(numeric in (0,1)) confidence level e.g. 0.95 |
stratify |
(logical) if TRUE, stratify by discrete covariates when bootstrapping. |
permute |
(logical) if TRUE, compute empirical p-values using permutation test |
num_permutations |
(numeric) number of permutations to use |
window |
the number of context words to be displayed around the keyword |
valuetype |
the type of pattern matching: |
case_insensitive |
logical; if |
hard_cut |
(logical) - if TRUE then a context must have |
verbose |
(logical) - if TRUE, report the documents that had no overlapping features with the pretrained embeddings provided. |
Value
a conText-class
object - a D x M matrix with D = dimensions
of the pre-trained feature embeddings provided and M = number of covariates
including the intercept. These represent the estimated regression coefficients.
These can be combined to compute ALC embeddings for different combinations of covariates.
The object also includes various informative attributes, importantly
a data.frame
with the following columns:
coefficient
(character) name of (covariate) coefficient.
value
(numeric) norm of the corresponding beta coefficient.
std.error
(numeric) (if bootstrap = TRUE) std. error of the norm of the beta coefficient.
lower.ci
(numeric) (if bootstrap = TRUE) lower bound of the confidence interval.
upper.ci
(numeric) (if bootstrap = TRUE) upper bound of the confidence interval.
p.value
(numeric) (if permute = TRUE) empirical p.value of the norm of the coefficient.
Examples
library(quanteda)
# tokenize corpus
toks <- tokens(cr_sample_corpus)
## given the target word "immigration"
set.seed(2021L)
model1 <- conText(formula = immigration ~ party + gender,
data = toks,
pre_trained = cr_glove_subset,
transform = TRUE, transform_matrix = cr_transform,
bootstrap = TRUE,
num_bootstraps = 100,
confidence_level = 0.95,
stratify = FALSE,
permute = TRUE, num_permutations = 10,
window = 6, case_insensitive = TRUE,
verbose = FALSE)
# notice, character/factor covariates are automatically "dummified"
rownames(model1)
# the beta coefficient 'partyR' in this case corresponds to the alc embedding
# of "immigration" for Republican party speeches
# (normed) coefficient table
model1@normed_coefficients