R: Find the Top-N most similar words.

most_similar {PsychWordVec}

R Documentation

Find the Top-N most similar words.

Description

Find the Top-N most similar words, which replicates the results produced by the Python gensim module most_similar() function. (Exact replication of gensim requires the same word vectors data, not the demodata used here in examples.)

Usage

most_similar(
  data,
  x = NULL,
  topn = 10,
  above = NULL,
  keep = FALSE,
  row.id = TRUE,
  verbose = TRUE
)

Arguments

`data`	A `wordvec` (data.table) or `embed` (matrix), see `data_wordvec_load`.
`x`	Can be: `NULL`: use the sum of all word vectors in `data` a single word: `"China"` a list of words: `c("king", "queen")` `cc(" king , queen ; man \| woman")` an R formula (`~ xxx`) specifying words that positively and negatively contribute to the similarity (for word analogy): `~ boy - he + she` `~ king - man + woman` `~ Beijing - China + Japan`
`topn`	Top-N most similar words. Defaults to `10`.
`above`	Defaults to `NULL`. Can be: a threshold value to find all words with cosine similarities higher than this value a critical word to find all words with cosine similarities higher than that with this critical word If both `topn` and `above` are specified, `above` wins.
`keep`	Keep words specified in `x` in results? Defaults to `FALSE`.
`row.id`	Return the row number of each word? Defaults to `TRUE`, which may help determine the relative word frequency in some cases.
`verbose`	Print information to the console? Defaults to `TRUE`.

Value

A data.table with the most similar words and their cosine similarities.

Download

Download pre-trained word vectors data (.RData): https://psychbruce.github.io/WordVector_RData.pdf

Examples

d = as_embed(demodata, normalize=TRUE)

most_similar(d)
most_similar(d, "China")
most_similar(d, c("king", "queen"))
most_similar(d, cc(" king , queen ; man | woman "))

# the same as above:
most_similar(d, ~ China)
most_similar(d, ~ king + queen)
most_similar(d, ~ king + queen + man + woman)

most_similar(d, ~ boy - he + she)
most_similar(d, ~ Jack - he + she)
most_similar(d, ~ Rose - she + he)

most_similar(d, ~ king - man + woman)
most_similar(d, ~ Tokyo - Japan + China)
most_similar(d, ~ Beijing - China + Japan)

most_similar(d, "China", above=0.7)
most_similar(d, "China", above="Shanghai")

# automatically normalized for more accurate results
ms = most_similar(demodata, ~ king - man + woman)
ms
str(ms)

[Package PsychWordVec version 2023.9 Index]