data_wordvec_subset {PsychWordVec}R Documentation

Extract a subset of word vectors data (with S3 methods).

Description

Extract a subset of word vectors data (with S3 methods). You may specify either a wordvec or embed loaded by data_wordvec_load) or an .RData file transformed by data_transform).

Usage

data_wordvec_subset(
  x,
  words = NULL,
  pattern = NULL,
  as = c("wordvec", "embed"),
  file.save,
  compress = "bzip2",
  compress.level = 9,
  verbose = TRUE
)

## S3 method for class 'wordvec'
subset(x, ...)

## S3 method for class 'embed'
subset(x, ...)

Arguments

x

Can be:

words

[Option 1] Character string(s).

pattern

[Option 2] Regular expression (see str_subset). If neither words nor pattern are specified (i.e., both are NULL), then all words in the data will be extracted.

as

Reshape to wordvec (data.table) or embed (matrix). Defaults to the original class of x.

file.save

File name of to-be-saved R data (must be .RData).

compress

Compression method for the saved file. Defaults to "bzip2".

Options include:

  • 1 or "gzip": modest file size (fastest)

  • 2 or "bzip2": small file size (fast)

  • 3 or "xz": minimized file size (slow)

compress.level

Compression level from 0 (none) to 9 (maximal compression for minimal file size). Defaults to 9.

verbose

Print information to the console? Defaults to TRUE.

...

Parameters passed to data_wordvec_subset when using the S3 method subset.

Value

A subset of wordvec or embed of valid (available) words.

Download

Download pre-trained word vectors data (.RData): https://psychbruce.github.io/WordVector_RData.pdf

See Also

as_wordvec / as_embed

load_wordvec / load_embed

get_wordvec

data_transform

Examples

## directly use `embed[i, j]` (3x faster than `wordvec`):
d = as_embed(demodata)
d[1:5]
d["people"]
d[c("China", "Japan", "Korea")]

## specify `x` as a `wordvec` or `embed` object:
subset(demodata, c("China", "Japan", "Korea"))
subset(d, pattern="^Chi")

## specify `x` and `pattern`, and save with `file.save`:
subset(demodata, pattern="Chin[ae]|Japan|Korea",
       file.save="subset.RData")

## load the subset:
d.subset = load_wordvec("subset.RData")
d.subset

## specify `x` as an .RData file and save with `file.save`:
data_wordvec_subset("subset.RData",
                    words=c("China", "Chinese"),
                    file.save="new.subset.RData")
d.new.subset = load_embed("new.subset.RData")
d.new.subset

unlink("subset.RData")  # delete file for code check
unlink("new.subset.RData")  # delete file for code check


[Package PsychWordVec version 2023.9 Index]