R: Imputation using generalized linear models for missing values

imputeglm.predict {BlockMissingData}

R Documentation

Imputation using generalized linear models for missing values

Description

The function performs imputation using generalized linear models for missing values in a dataset. It fits these models for each specified response variable separately, utilizing other specified variables, and returns the estimated coefficients and predicted values for each variable. The function handles different distribution families, such as Gaussian, Binomial, and Ordinal, for GLM estimation.

Usage

imputeglm.predict(X, ind_y, ind_x = -ind_y, miss, newdata, family = "gaussian")

Arguments

`X`	Data matrix containing all the variables that may contain missing values.
`ind_y`	A vector specifying the indices of response variables in the dataset.
`ind_x`	A vector specifying the indices of predictor variables in the dataset. By default, it is set to -ind_y, which means all variables other than the response variables are considered as predictors.
`miss`	A logical matrix indicating the missing values in the dataset.
`newdata`	Data matrix for which imputed values are required. It should have the same column names as the original dataset.
`family`	A character indicating the distribution family of the GLM. Possible values are "gaussian" (default), "binomial", and "ordinal".

Value

A list containing the imputed values for each response variable.

`B`	A matrix of estimated coefficients, where each column contains the coefficients for a response variable, and each row corresponds to a predictor variable (including the intercept term)
`PRED`	A matrix of predicted values (or imputations), where each column contains the predicted values for a response variable, and each row corresponds to an observation in the newdata (if provided)

Author(s)

Fei Xue and Annie Qu

Examples


library(MASS)

# Number of subjects
n <- 700

# Number of total covariates
p <- 40

# Number of missing groups of subjects
ngroup <- 4

# Number of data sources
nsource <- 4

# Starting indexes of covariates in data sources
cov_index=c(1, 13, 25, 37)

# Starting indexes of subjects in missing groups
sub_index=c(1, 31, 251, 471)

# Indexes of missing data sources in missing groups, respectively ('NULL' represents no missing)
miss_source=list(NULL, 3, 2, 1)

# Create a design matrix
set.seed(1)
sigma=diag(1-0.4,p,p)+matrix(0.4,p,p)
X <- mvrnorm(n,rep(0,p),sigma)

# Introduce some block-wise missing
for (i in 1:ngroup) {
  if (!is.null(miss_source[[i]])) {
    if (i==ngroup) {
      if (miss_source[[i]]==nsource) {
        X[sub_index[i]:n, cov_index[miss_source[[i]]]:p] = NA
      } else {
        X[sub_index[i]:n, cov_index[miss_source[[i]]]:(cov_index[miss_source[[i]]+1]-1)] = NA
      }
    } else {
      if (miss_source[[i]]==nsource) {
        X[sub_index[i]:(sub_index[i+1]-1), cov_index[miss_source[[i]]]:p] = NA
      } else {
        X[sub_index[i]:(sub_index[i+1]-1), cov_index[miss_source[[i]]]:
        (cov_index[miss_source[[i]]+1]-1)] = NA
      }
    }
  }
}

# Define missing data pattern
miss <- is.na(X)
# Choose response and predictor variables
ind_y <- 25:36
ind_x <- 13:24
# Data that need imputation
newdata <- X[31:250,]
# Use the function
result <- imputeglm.predict(X = X, ind_y = ind_y, ind_x = ind_x, miss = miss, newdata = newdata)

[Package BlockMissingData version 0.1.0 Index]