R: Word Embedding Association Test (WEAT) and Single-Category...

test_WEAT {PsychWordVec}

R Documentation

Word Embedding Association Test (WEAT) and Single-Category WEAT.

Description

Tabulate data (cosine similarity and standardized effect size) and conduct the permutation test of significance for the Word Embedding Association Test (WEAT) and Single-Category Word Embedding Association Test (SC-WEAT).

For WEAT, two-samples permutation test is conducted (i.e., rearrangements of data).
For SC-WEAT, one-sample permutation test is conducted (i.e., rearrangements of +/- signs to data).

Usage

test_WEAT(
  data,
  T1,
  T2,
  A1,
  A2,
  use.pattern = FALSE,
  labels = list(),
  p.perm = TRUE,
  p.nsim = 10000,
  p.side = 2,
  seed = NULL,
  pooled.sd = "Caliskan"
)

Arguments

`data`	A `wordvec` (data.table) or `embed` (matrix), see `data_wordvec_load`.
`T1`, `T2`	Target words (a vector of words or a pattern of regular expression). If only `T1` is specified, it will tabulate data for single-category WEAT (SC-WEAT).
`A1`, `A2`	Attribute words (a vector of words or a pattern of regular expression). Both must be specified.
`use.pattern`	Defaults to `FALSE` (using a vector of words). If you use regular expression in `T1`, `T2`, `A1`, and `A2`, please specify this argument as `TRUE`.
`labels`	Labels for target and attribute concepts (a named `list`), such as (the default) `list(T1="Target1", T2="Target2", A1="Attrib1", A2="Attrib2")`.
`p.perm`	Permutation test to get exact or approximate p value of the overall effect. Defaults to `TRUE`. See also the `sweater` package.
`p.nsim`	Number of samples for resampling in permutation test. Defaults to `10000`. If `p.nsim` is larger than the number of all possible permutations (rearrangements of data), then it will be ignored and an exact permutation test will be conducted. Otherwise (in most cases for real data and always for SC-WEAT), a resampling test is performed, which takes much less computation time and produces the approximate p value (comparable to the exact one).
`p.side`	One-sided (`1`) or two-sided (`2`) p value. Defaults to `2`. In Caliskan et al.'s (2017) article, they reported one-sided p value for WEAT. Here, I suggest reporting two-sided p value as a more conservative estimate. The users take the full responsibility for the choice. The one-sided p value is calculated as the proportion of sampled permutations where the difference in means is greater than the test statistic. The two-sided p value is calculated as the proportion of sampled permutations where the absolute difference is greater than the test statistic.
`seed`	Random seed for reproducible results of permutation test. Defaults to `NULL`.
`pooled.sd`	Method used to calculate the pooled SD for effect size estimate in WEAT. Defaults to `"Caliskan"`: `sd(data.diff$cos_sim_diff)`, which is highly suggested and identical to Caliskan et al.'s (2017) original approach. Otherwise specified, it will calculate the pooled SD as: `\sqrt{[(n_1 - 1) * \sigma_1^2 + (n_2 - 1) * \sigma_2^2] / (n_1 + n_2 - 2)}`. This is NOT suggested because it may overestimate the effect size, especially when there are only a few T1 and T2 words that have small variances.

Value

A list object of new class weat:

words.valid: Valid (actually matched) words
words.not.found: Words not found
data.raw: A data.table of cosine similarities between all word pairs
data.mean: A data.table of mean cosine similarities across all attribute words
data.diff: A data.table of differential mean cosine similarities between the two attribute concepts
eff.label: Description for the difference between the two attribute concepts
eff.type: Effect type: WEAT or SC-WEAT
eff: Raw effect, standardized effect size, and p value (if p.perm=TRUE)

Download

Download pre-trained word vectors data (.RData): https://psychbruce.github.io/WordVector_RData.pdf

References

Caliskan, A., Bryson, J. J., & Narayanan, A. (2017). Semantics derived automatically from language corpora contain human-like biases. Science, 356(6334), 183–186.

Examples

## cc() is more convenient than c()!

weat = test_WEAT(
  demodata,
  labels=list(T1="King", T2="Queen", A1="Male", A2="Female"),
  T1=cc("king, King"),
  T2=cc("queen, Queen"),
  A1=cc("male, man, boy, brother, he, him, his, son"),
  A2=cc("female, woman, girl, sister, she, her, hers, daughter"),
  seed=1)
weat

sc_weat = test_WEAT(
  demodata,
  labels=list(T1="Occupation", A1="Male", A2="Female"),
  T1=cc("
    architect, boss, leader, engineer, CEO, officer, manager,
    lawyer, scientist, doctor, psychologist, investigator,
    consultant, programmer, teacher, clerk, counselor,
    salesperson, therapist, psychotherapist, nurse"),
  A1=cc("male, man, boy, brother, he, him, his, son"),
  A2=cc("female, woman, girl, sister, she, her, hers, daughter"),
  seed=1)
sc_weat

## Not run: 

## the same as the first example, but using regular expression
weat = test_WEAT(
  demodata,
  labels=list(T1="King", T2="Queen", A1="Male", A2="Female"),
  use.pattern=TRUE,  # use regular expression below
  T1="^[kK]ing$",
  T2="^[qQ]ueen$",
  A1="^male$|^man$|^boy$|^brother$|^he$|^him$|^his$|^son$",
  A2="^female$|^woman$|^girl$|^sister$|^she$|^her$|^hers$|^daughter$",
  seed=1)
weat

## replicating Caliskan et al.'s (2017) results
## WEAT7 (Table 1): d = 1.06, p = .018
## (requiring installation of the `sweater` package)
Caliskan.WEAT7 = test_WEAT(
  as_wordvec(sweater::glove_math),
  labels=list(T1="Math", T2="Arts", A1="Male", A2="Female"),
  T1=cc("math, algebra, geometry, calculus, equations, computation, numbers, addition"),
  T2=cc("poetry, art, dance, literature, novel, symphony, drama, sculpture"),
  A1=cc("male, man, boy, brother, he, him, his, son"),
  A2=cc("female, woman, girl, sister, she, her, hers, daughter"),
  p.side=1, seed=1234)
Caliskan.WEAT7
# d = 1.055, p = .0173 (= 173 counts / 10000 permutation samples)

## replicating Caliskan et al.'s (2017) supplemental results
## WEAT7 (Table S1): d = 0.97, p = .027
Caliskan.WEAT7.supp = test_WEAT(
  demodata,
  labels=list(T1="Math", T2="Arts", A1="Male", A2="Female"),
  T1=cc("math, algebra, geometry, calculus, equations, computation, numbers, addition"),
  T2=cc("poetry, art, dance, literature, novel, symphony, drama, sculpture"),
  A1=cc("male, man, boy, brother, he, him, his, son"),
  A2=cc("female, woman, girl, sister, she, her, hers, daughter"),
  p.side=1, seed=1234)
Caliskan.WEAT7.supp
# d = 0.966, p = .0221 (= 221 counts / 10000 permutation samples)

## End(Not run)

[Package PsychWordVec version 2023.9 Index]