most_similar {PsychWordVec}R Documentation

Find the Top-N most similar words.

Description

Find the Top-N most similar words, which replicates the results produced by the Python gensim module most_similar() function. (Exact replication of gensim requires the same word vectors data, not the demodata used here in examples.)

Usage

most_similar(
  data,
  x = NULL,
  topn = 10,
  above = NULL,
  keep = FALSE,
  row.id = TRUE,
  verbose = TRUE
)

Arguments

data

A wordvec (data.table) or embed (matrix), see data_wordvec_load.

x

Can be:

  • NULL: use the sum of all word vectors in data

  • a single word:

    "China"

  • a list of words:

    c("king", "queen")

    cc(" king , queen ; man | woman")

  • an R formula (~ xxx) specifying words that positively and negatively contribute to the similarity (for word analogy):

    ~ boy - he + she

    ~ king - man + woman

    ~ Beijing - China + Japan

topn

Top-N most similar words. Defaults to 10.

above

Defaults to NULL. Can be:

  • a threshold value to find all words with cosine similarities higher than this value

  • a critical word to find all words with cosine similarities higher than that with this critical word

If both topn and above are specified, above wins.

keep

Keep words specified in x in results? Defaults to FALSE.

row.id

Return the row number of each word? Defaults to TRUE, which may help determine the relative word frequency in some cases.

verbose

Print information to the console? Defaults to TRUE.

Value

A data.table with the most similar words and their cosine similarities.

Download

Download pre-trained word vectors data (.RData): https://psychbruce.github.io/WordVector_RData.pdf

See Also

sum_wordvec

dict_expand

dict_reliability

cosine_similarity

pair_similarity

plot_similarity

tab_similarity

Examples

d = as_embed(demodata, normalize=TRUE)

most_similar(d)
most_similar(d, "China")
most_similar(d, c("king", "queen"))
most_similar(d, cc(" king , queen ; man | woman "))

# the same as above:
most_similar(d, ~ China)
most_similar(d, ~ king + queen)
most_similar(d, ~ king + queen + man + woman)

most_similar(d, ~ boy - he + she)
most_similar(d, ~ Jack - he + she)
most_similar(d, ~ Rose - she + he)

most_similar(d, ~ king - man + woman)
most_similar(d, ~ Tokyo - Japan + China)
most_similar(d, ~ Beijing - China + Japan)

most_similar(d, "China", above=0.7)
most_similar(d, "China", above="Shanghai")

# automatically normalized for more accurate results
ms = most_similar(demodata, ~ king - man + woman)
ms
str(ms)


[Package PsychWordVec version 2023.9 Index]