costring {LSAfun} | R Documentation |
Sentence Comparison
Description
Computes cosine values between sentences and/or documents
Usage
costring(x,y,tvectors=tvectors,split=" ",remove.punctuation=TRUE,
stopwords = NULL, method ="Add")
Arguments
x |
a character vector |
y |
a character vector |
tvectors |
the semantic space in which the computation is to be done (a numeric matrix where every row is a word vector) |
split |
a character vector defining the character used to split the documents into words (white space by default) |
remove.punctuation |
removes punctuation from |
stopwords |
a character vector defining a list of words that are not used to compute the document/sentence vector for |
method |
the compositional model to compute the document vector from its word vectors. The default option |
Details
This function computes the cosine between two documents (or sentences) or the cosine between a single word and a document (or sentence).
In the traditional LSA approach, the vector D for a document (or a sentence) consisting of the words (t1, . , tn) is computed as
D = \sum\limits_{i=1}^n t_n
This is the default method (method="Add"
) for this function. Alternatively, this function provided the possibility of computing the document vector from its word vectors using element-wise multiplication (see Mitchell & Lapata, 2010 and compose
).
The format of x
(or y
) can be of the kind x <- "word1 word2 word3"
, but also of the kind x <- c("word1", "word2", "word3")
. This allows for simple copy&paste-inserting of text,
but also for using character vectors, e.g. the output of neighbors()
.
To import a document Document.txt to from a directory for comparisons, set your working
directory to this directory using setwd()
. Then use the following command lines:
fileName1 <- "Alice_in_Wonderland.txt"
x <- readChar(fileName1, file.info(fileName1)$size)
A note will be displayed whenever not all words of one input string are found in the semantic space. Caution: In that case, the function will still produce a result, by omitting the words not found in the semantic space. Depending on the specific requirements of a task, this may compromise the results. Please check your input when you receive this message.
A warning message will be displayed whenever no word of one input string is found in the semantic space.
Value
A numeric giving the cosine between the input documents/sentences
Author(s)
Fritz Guenther
References
Landauer, T.K., & Dumais, S.T. (1997). A solution to Plato's problem: The Latent Semantic Analysis theory of acquisition, induction and representation of knowledge. Psychological Review, 104, 211-240.
Dennis, S. (2007). How to use the LSA Web Site. In T. K. Landauer, D. S. McNamara, S. Dennis, & W. Kintsch (Eds.), Handbook of Latent Semantic Analysis (pp. 35-56). Mahwah, NJ: Erlbaum.
Mitchell, J., & Lapata, M. (2010). Composition in Distributional Models of Semantics. Cognitive Science, 34, 1388-1429.
See Also
cosine
,
Cosine
,
multicos
,
multidocs
,
multicostring
Examples
data(wonderland)
costring("alice was beginning to get very tired.",
"a white rabbit with a clock ran close to her.",
tvectors=wonderland)