delete_MCAR {missMethods} | R Documentation |
Create MCAR values
Description
Create missing completely at random (MCAR) values in a data frame or a matrix
Usage
delete_MCAR(
ds,
p,
cols_mis = seq_len(ncol(ds)),
n_mis_stochastic = FALSE,
p_overall = FALSE,
miss_cols,
stochastic
)
Arguments
ds |
A data frame or matrix in which missing values will be created. |
p |
A numeric vector with length one or equal to length |
cols_mis |
A vector of column names or indices of columns in which missing values will be created. |
n_mis_stochastic |
Logical, should the number of missing values be
stochastic? If |
p_overall |
Logical; see details. |
miss_cols |
Deprecated, use |
stochastic |
Deprecated, use |
Details
This function creates missing completely at random (MCAR) values in
the dataset ds
.
The proportion of missing values is specified with p
.
The columns in which missing values are created can be set via cols_mis
.
If cols_mis
is not specified, then missing values are created in
all columns of ds
.
The probability for missing values is controlled by p
. If p
is
a single number, then the overall probability for a value to be missing will
be p
in all columns of cols_mis
. (Internally p
will be
replicated to a vector of the same length as cols_mis
. So, all
p[i]
in the following sections will be equal to the given single
number p
.) Otherwise, p
must be of the same length as
cols_mis
. In this case, the overall probability for a value to be
missing will be p[i]
in the column cols_mis[i]
.
If n_mis_stochastic = FALSE
and p_overall = FALSE
(the default), then
exactly round(nrow(ds) * p[i])
values will be set NA
in column
cols_mis[i]
. If n_mis_stochastic = FALSE
and p_overall =
TRUE
, then p
must be of length one and exactly round(nrow(ds) *
p * length(cols_mis))
values will be set NA (over all columns in
cols_mis
). This can result in a proportion of missing values in every
miss_col
unequal to p
, but the proportion of missing values in
all columns together will be close to p
.
If n_mis_stochastic = TRUE
, then each value in column
cols_mis[i]
has probability p[i]
to be missing (independently
of all other values). Therefore, the number of missing values in
cols_mis[i]
is a random variable with a binomial distribution
B(nrow(ds)
, p[i]
). This can (and will most of the time)
lead to more or less missing values than round(nrow(ds) * p[i])
in
column cols_mis[i]
. If n_mis_stochastic = TRUE
, then the
argument p_overall
is ignored because it is superfluous.
Value
An object of the same class as ds
with missing values.
References
Santos, M. S., Pereira, R. C., Costa, A. F., Soares, J. P., Santos, J., & Abreu, P. H. (2019). Generating Synthetic Missing Data: A Review by Missing Mechanism. IEEE Access, 7, 11651-11667
Examples
ds <- data.frame(X = 1:20, Y = 101:120)
delete_MCAR(ds, 0.2)