R: Extract a subset of word vectors data (with S3 methods).

data_wordvec_subset {PsychWordVec}

R Documentation

Extract a subset of word vectors data (with S3 methods).

Description

Extract a subset of word vectors data (with S3 methods). You may specify either a wordvec or embed loaded by data_wordvec_load) or an .RData file transformed by data_transform).

Usage

data_wordvec_subset(
  x,
  words = NULL,
  pattern = NULL,
  as = c("wordvec", "embed"),
  file.save,
  compress = "bzip2",
  compress.level = 9,
  verbose = TRUE
)

## S3 method for class 'wordvec'
subset(x, ...)

## S3 method for class 'embed'
subset(x, ...)

Arguments

`x`	Can be: a `wordvec` or `embed` loaded by `data_wordvec_load` an .RData file transformed by `data_transform`
`words`	[Option 1] Character string(s).
`pattern`	[Option 2] Regular expression (see `str_subset`). If neither `words` nor `pattern` are specified (i.e., both are `NULL`), then all words in the data will be extracted.
`as`	Reshape to `wordvec` (data.table) or `embed` (matrix). Defaults to the original class of `x`.
`file.save`	File name of to-be-saved R data (must be .RData).
`compress`	Compression method for the saved file. Defaults to `"bzip2"`. Options include: `1` or `"gzip"`: modest file size (fastest) `2` or `"bzip2"`: small file size (fast) `3` or `"xz"`: minimized file size (slow)
`compress.level`	Compression level from `0` (none) to `9` (maximal compression for minimal file size). Defaults to `9`.
`verbose`	Print information to the console? Defaults to `TRUE`.
`...`	Parameters passed to `data_wordvec_subset` when using the S3 method `subset`.

Value

A subset of wordvec or embed of valid (available) words.

Download

Download pre-trained word vectors data (.RData): https://psychbruce.github.io/WordVector_RData.pdf

Examples

## directly use `embed[i, j]` (3x faster than `wordvec`):
d = as_embed(demodata)
d[1:5]
d["people"]
d[c("China", "Japan", "Korea")]

## specify `x` as a `wordvec` or `embed` object:
subset(demodata, c("China", "Japan", "Korea"))
subset(d, pattern="^Chi")

## specify `x` and `pattern`, and save with `file.save`:
subset(demodata, pattern="Chin[ae]|Japan|Korea",
       file.save="subset.RData")

## load the subset:
d.subset = load_wordvec("subset.RData")
d.subset

## specify `x` as an .RData file and save with `file.save`:
data_wordvec_subset("subset.RData",
                    words=c("China", "Chinese"),
                    file.save="new.subset.RData")
d.new.subset = load_embed("new.subset.RData")
d.new.subset

unlink("subset.RData")  # delete file for code check
unlink("new.subset.RData")  # delete file for code check

[Package PsychWordVec version 2023.9 Index]