R: Graphical investigation for the number of datasets generated...

choosem {clusterMI}

R Documentation

Graphical investigation for the number of datasets generated by multiple imputation

Description

For an object generated by the function clusterMI, the choosem function browses the sequence of the contributory partitions and computes the consensus partition at each step. Then, the rand index between successive consensus partitions is plotted.

Usage

choosem(output, graph = TRUE, nnodes = 1)

Arguments

`output`	an output from the clusterMI function
`graph`	a boolean indicating if a graphic is plotted
`nnodes`	number of CPU cores for parallel computing. By default `nnodes = 1`.

Details

The number of imputed datasets (m) should be sufficiently large to improve the partition accuracy. The choosem function can be used to check if this number is suitable. This function computes the consensus partition by considering only the first imputed datasets. By this way, a sequence of m consensus partitions is obtained. Then, the rand index between successive partitions is computed and reported in a graph. The rand index measures the proximity between two partitions. If the rand index between the last consensus partitions of the sequence reaches its maximum values (1), then it means last imputed dataset does not modify the consensus partition. Consequently, the number of imputed datasets can be considered as sufficiently large.

Value

A list of two objects

`part`	`m`-columns matrix that contains in column p the consensus partition using only the p first imputed datasets
`rand`	a `m`-1 vector given the rand index between the `m` successive consensus partitions

References

Audigier, V. and Niang, N., Clustering with missing data: which equivalent for Rubin's rules? Advances in Data Analysis and Classification <doi:10.1007/s11634-022-00519-1>, 2022.

Examples

data(wine)

set.seed(123456)
ref <- wine$cult
nb.clust <- 3
wine.na <- wine
wine.na$cult <- NULL
wine.na <- prodna(wine.na)

#imputation
m <- 5 # number of imputed data sets. Should be larger in practice
res.imp <- imputedata(data.na = wine.na, nb.clust = nb.clust, m = m)

#pooling
nnodes <- 2 # number of CPU cores for parallel computing
res.pool <- clusterMI(res.imp, instability = FALSE, nnodes = nnodes)

res.choosem <- choosem(res.pool)

[Package clusterMI version 1.2.1 Index]