imputeX {cmfrec} | R Documentation |
Impute missing entries in 'X' data
Description
Replace 'NA'/'NaN' values in new 'X' data according to the model predictions, given that same 'X' data and optionally 'U' data.
Note: this function will not perform any internal re-indexing for the data. If the 'X' to which the data was fit was a 'data.frame', the numeration of the items will be under 'model$info$item_mapping'. There is also a function predict_new which will let the model do the appropriate reindexing.
Usage
imputeX(
model,
X,
weight = NULL,
U = NULL,
U_bin = NULL,
nthreads = model$info$nthreads
)
Arguments
model |
A collective matrix factorization model as output by function CMF. This functionality is not available for the other model classes. |
X |
New 'X' data with missing values which will be imputed. Must be passed as a dense matrix from base R (class 'matrix'). |
weight |
Associated observation weights for entries in 'X'. If passed, must have the same shape as 'X'. |
U |
New 'U' data, with rows matching to those of 'X'. Can be passed in the following formats:
|
U_bin |
New binary columns of 'U' (rows matching to those of 'X'). Must be passed as a dense matrix from base R or as a 'data.frame'. |
nthreads |
Number of parallel threads to use. |
Details
If using the matrix factorization model as a general missing-value imputer, it's recommended to:
Fit a model without user biases.
Set a lower regularization for the item biases than for the matrices.
Tune the regularization parameter(s) very well.
In general, matrix factorization works better for imputation of selected entries of sparse-and-wide matrices, whereas for dense matrices, the method is unlikely to provide better results than mean/median imputation, but it is nevertheless provided for experimentation purposes.
Value
The 'X' matrix with its missing values imputed according to the model predictions.
Examples
library(cmfrec)
### Simplest example
SeqMat <- matrix(1:50, nrow=10)
SeqMat[2,1] <- NaN
SeqMat[8,3] <- NaN
m <- CMF(SeqMat, k=1, lambda=1e-10, nthreads=1L, verbose=FALSE)
imputeX(m, SeqMat)
### Better example with multivariate normal data
if (require("MASS")) {
### Generate random data, set some values as NA
set.seed(1)
n_rows <- 1000
n_cols <- 5
mu <- rnorm(n_cols)
S <- matrix(rnorm(n_cols^2), nrow = n_cols)
S <- t(S) %*% S
X <- MASS::mvrnorm(n_rows, mu, S)
X_na <- X
values_NA <- matrix(runif(n_rows*n_cols) < .15, nrow=n_rows)
X_na[values_NA] <- NaN
### In the event that any column is fully missing
if (any(colSums(is.na(X_na)) == n_rows)) {
cols_remove <- colSums(is.na(X_na)) == n_rows
X_na <- X_na[, !cols_remove, drop=FALSE]
values_NA <- values_NA[, !cols_remove, drop=FALSE]
}
### Impute missing values with model
model <- CMF(X_na, k=3, lambda=c(0,0,1,1,1,1),
user_bias=FALSE,
verbose=FALSE, nthreads=1L)
X_imputed <- imputeX(model, X_na)
cat(sprintf("RMSE for imputed values w/model: %f\n",
sqrt(mean((X[values_NA] - X_imputed[values_NA])^2))))
### Compare against simple mean imputation
X_means <- apply(X_na, 2, mean, na.rm=TRUE)
X_imp_mean <- X_na
for (cl in 1:n_cols)
X_imp_mean[values_NA[,cl], cl] <- X_means[cl]
cat(sprintf("RMSE for imputed values w/means: %f\n",
sqrt(mean((X[values_NA] - X_imp_mean[values_NA])^2))))
}