standardize.lspace {lingmatch} | R Documentation |
Standardize a Latent Semantic Space
Description
Reformat a .rda file which has a matrix with terms as row names, or a plain-text embeddings file which has a term at the start of each line, and consistent delimiting characters. Plain-text files are processed line-by-line, so large spaces can be reformatted RAM-conservatively.
Usage
standardize.lspace(infile, name, sep = " ", digits = 9,
dir = getOption("lingmatch.lspace.dir"), outdir = dir, remove = "",
term_check = "^[a-zA-Z]+$|^['a-zA-Z][a-zA-Z.'\\/-]*[a-zA-Z.]$",
verbose = FALSE)
Arguments
infile |
Name of the .rda or plain-text file relative to |
name |
Base name of the reformatted file and term file; e.g., "glove" would result in
|
sep |
Delimiting character between values in each line, e.g., |
digits |
Number of digits to round values to; default is 9. |
dir |
Path to folder containing |
outdir |
Path to folder in which to save standardized files; default is |
remove |
A string with a regex pattern to be removed from term names |
term_check |
A string with a regex pattern by which to filter terms; i.e., only lines with fully
matched terms are written to the reformatted file. The default attempts to retain only regular words, including
those with dashes, foreword slashes, and periods. Set to an empty string ( |
verbose |
Logical: if |
Value
Path to the standardized [1] data file and [2] terms file if applicable.
See Also
Other Latent Semantic Space functions:
download.lspace()
,
lma_lspace()
,
select.lspace()
Examples
## Not run:
# from https://sites.google.com/site/fritzgntr/software-resources/semantic_spaces
standardize.lspace("EN_100k_lsa.rda", "100k_lsa")
# from https://fasttext.cc/docs/en/english-vectors.html
standardize.lspace("crawl-300d-2M.vec", "facebook_crawl")
# Standardized versions of these spaces can also be downloaded with download.lspace.
## End(Not run)