delete_MAR_one_group {missMethods} | R Documentation |
Create MAR values by deleting values in one of two groups
Description
Create missing at random (MAR) values by deleting values in one of two groups in a data frame or a matrix
Usage
delete_MAR_one_group(
ds,
p,
cols_mis,
cols_ctrl,
cutoff_fun = median,
prop = 0.5,
use_lpSolve = TRUE,
ordered_as_unordered = FALSE,
n_mis_stochastic = FALSE,
...,
miss_cols,
ctrl_cols,
stochastic
)
Arguments
ds |
A data frame or matrix in which missing values will be created. |
p |
A numeric vector with length one or equal to length |
cols_mis |
A vector of column names or indices of columns in which missing values will be created. |
cols_ctrl |
A vector of column names or indices of columns, which
controls the creation of missing values in |
cutoff_fun |
Function that calculates the cutoff values in the
|
prop |
Numeric of length one; (minimum) proportion of rows in group 1 (only used for unordered factors). |
use_lpSolve |
Logical; should lpSolve be used for the determination of
groups, if |
ordered_as_unordered |
Logical; should ordered factors be treated as unordered factors. |
n_mis_stochastic |
Logical, should the number of missing values be
stochastic? If |
... |
Further arguments passed to |
miss_cols |
Deprecated, use |
ctrl_cols |
Deprecated, use |
stochastic |
Deprecated, use |
Details
This function creates missing at random (MAR) values in the columns
specified by the argument cols_mis
.
The probability for missing values is controlled by p
.
If p
is a single number, then the overall probability for a value to
be missing will be p
in all columns of cols_mis
.
(Internally p
will be replicated to a vector of the same length as
cols_mis
.
So, all p[i]
in the following sections will be equal to the given
single number p
.)
Otherwise, p
must be of the same length as cols_mis
.
In this case, the overall probability for a value to be missing will be
p[i]
in the column cols_mis[i]
.
The position of the missing values in cols_mis[i]
is controlled by
cols_ctrl[i]
.
The following procedure is applied for each pair of cols_ctrl[i]
and
cols_mis[i]
to determine the positions of missing values:
At first, the rows of ds
are divided into two groups.
Therefore, the cutoff_fun
calculates a cutoff value for
cols_ctrl[i]
(via cutoff_fun(ds[, cols_ctrl[i]], ...)
.
The group 1 consists of the rows, whose values in
cols_ctrl[i]
are below the calculated cutoff value.
If the so defined group 1 is empty, the rows that are equal to the
cutoff value will be added to this group (otherwise, these rows will
belong to group 2).
The group 2 consists of the remaining rows, which are not part of group 1.
Now one of these two groups is chosen randomly.
In the chosen group, values are deleted in cols_mis[i]
.
In the other group, no missing values will be created in cols_mis[i]
.
If p
is too high, it is possible that a group contains not enough
objects to reach nrow(ds) * p
missing values. In this case, p
is reduced to the maximum possible value (given the (random) group with
missing data) and a warning is given. Obviously this case will occur
regularly, if p > 0.5
. Therefore, this function should normally not be
called with p > 0.5
. However, this can occur for smaller values
of p
, too (depending on the grouping). The warning can be silenced by
setting the option missMethods.warn.too.high.p
to false.
Value
An object of the same class as ds
with missing values.
Treatment of factors
If ds[, cols_ctrl[i]]
is an unordered factor, then the concept of a
cutoff value is not meaningful and cannot be applied.
Instead, a combinations of the levels of the unordered factor is searched that
guarantees at least a proportion of
prop
rows are in group 1minimize the difference between
prop
and the proportion of rows in group 1.
This can be seen as a binary search problem, which is solved by the solver
from the package lpSolve
, if use_lpSolve = TRUE
.
If use_lpSolve = FALSE
, a very simple heuristic is applied.
The heuristic only guarantees that at least a proportion of prop
rows
are in group 1.
The choice use_lpSolve = FALSE
is not recommend and should only be
considered, if the solver of lpSolve fails.
If ordered_as_unordered = TRUE
, then ordered factors will be treated
like unordered factors and the same binary search problem will be solved for
both types of factors.
If ordered_as_unordered = FALSE
(the default), then ordered factors
will be grouped via cutoff_fun
as described in Details.
References
Santos, M. S., Pereira, R. C., Costa, A. F., Soares, J. P., Santos, J., & Abreu, P. H. (2019). Generating Synthetic Missing Data: A Review by Missing Mechanism. IEEE Access, 7, 11651-11667
See Also
Other functions to create MAR:
delete_MAR_1_to_x()
,
delete_MAR_censoring()
,
delete_MAR_rank()
Examples
ds <- data.frame(X = 1:20, Y = 101:120)
delete_MAR_one_group(ds, 0.2, "X", "Y")