dtm_align {udpipe} | R Documentation |
Reorder a Document-Term-Matrix alongside a vector or data.frame
Description
This utility function is useful to align a Document-Term-Matrix with
information in a data.frame or a vector to predict, such that both the predictive information as well as the target
is available in the same order.
Matching is done based on the identifiers in the rownames of x
and either the names of the y
vector
or the first column of y
in case it is a data.frame.
Usage
dtm_align(x, y, FUN, ...)
Arguments
x |
a Document-Term-Matrix of class dgCMatrix (which can be an object returned by |
y |
either a vector or data.frame containing something to align with
|
FUN |
a function to be applied on |
... |
further arguments passed on to FUN |
Value
a list with elements x
and y
containing the document term matrix x
in the same order as y
.
If in
y
a vector was passed, the returnedy
element will be a vectorIf in
y
a data.frame was passed with more than 2 columns, the returnedy
element will be a data.frameIf in
y
a data.frame was passed with exactly 2 columns, the returnedy
element will be a vector
Only returns data of x
with overlapping identifiers in y
.
See Also
Examples
x <- matrix(1:9, nrow = 3, dimnames = list(c("a", "b", "c")))
x
dtm_align(x = x,
y = c(b = 1, a = 2, c = 6, d = 6))
dtm_align(x = x,
y = c(b = 1, a = 2, c = 6, d = 6, d = 7, a = -1))
data(brussels_reviews)
data(brussels_listings)
x <- brussels_reviews
x <- strsplit.data.frame(x, term = "feedback", group = "listing_id")
x <- document_term_frequencies(x)
x <- document_term_matrix(x)
y <- brussels_listings$price
names(y) <- brussels_listings$listing_id
## align a matrix of predictors with a vector to predict
trainset <- dtm_align(x = x, y = y)
trainset <- dtm_align(x = x, y = y, FUN = function(dtm){
dtm <- dtm_remove_lowfreq(dtm, minfreq = 5)
dtm <- dtm_sample(dtm)
dtm
})
head(names(y))
head(rownames(x))
head(names(trainset$y))
head(rownames(trainset$x))
## align a matrix of predictors with a data.frame
trainset <- dtm_align(x = x, y = brussels_listings[, c("listing_id", "price")])
trainset <- dtm_align(x = x,
y = brussels_listings[, c("listing_id", "price", "room_type")])
head(trainset$y$listing_id)
head(rownames(trainset$x))
## example with duplicate data in case of data balancing
dtm_align(x = matrix(1:30, nrow = 3, dimnames = list(c("a", "b", "c"))),
y = c(a = 1, a = 2, b = 3, d = 6, b = 6))
target <- subset(brussels_listings, listing_id %in% brussels_reviews$listing_id)
target <- rbind(target[1:3, ], target[c(2, 3), ], target[c(1, 4), ])
trainset <- dtm_align(x = x, y = target[, c("listing_id", "price")])
trainset <- dtm_align(x = x, y = setNames(target$price, target$listing_id))
names(trainset$y)
rownames(trainset$x)