R: Bicluster data with non-random missing values

biclustermd {biclustermd}

R Documentation

Bicluster data with non-random missing values

Description

Bicluster data with non-random missing values

Usage

biclustermd(
  data,
  row_clusters = floor(sqrt(nrow(data))),
  col_clusters = floor(sqrt(ncol(data))),
  miss_val = mean(data, na.rm = TRUE),
  miss_val_sd = 1,
  similarity = "Rand",
  row_min_num = floor(nrow(data)/row_clusters),
  col_min_num = floor(ncol(data)/col_clusters),
  row_num_to_move = 1,
  col_num_to_move = 1,
  row_shuffles = 1,
  col_shuffles = 1,
  max.iter = 100,
  verbose = FALSE
)

Arguments

`data`	Dataset to bicluster. Must to be a data matrix with only numbers and missing values in the data set. It should have row names and column names.
`row_clusters`	The number of clusters to partition the rows into. The default is `floor(sqrt(nrow(data)))`.
`col_clusters`	The number of clusters to partition the columns into. The default is `floor(sqrt(ncol(data)))`.
`miss_val`	Value or function to put in empty cells of the prototype matrix. If a value, a random normal variable with sd = `miss_val_sd` is used each iteration. By default, this equals the mean of `data`.
`miss_val_sd`	Standard deviation of the normal distribution `miss_val` follows if `miss_val` is a number. By default this equals 1.
`similarity`	The metric used to compare two successive clusterings. Can be "Rand" (default), "HA" for the Hubert and Arabie adjusted Rand index or "Jaccard". See RRand for details.
`row_min_num`	Minimum row prototype size in order to be eligible to be chosen when filling an empty row prototype. Default is `floor(nrow(data) / row_clusters)`.
`col_min_num`	Minimum column prototype size in order to be eligible to be chosen when filling an empty row prototype. Default is `floor(ncol(data) / col_clusters)`.
`row_num_to_move`	Number of rows to remove from the sampled prototype to put in the empty row prototype. Default is 1.
`col_num_to_move`	Number of columns to remove from the sampled prototype to put in the empty column prototype. Default is 1.
`row_shuffles`	Number of times to shuffle rows in each iteration. Default is 1.
`col_shuffles`	Number of times to shuffle columns in each iteration. Default is 1.
`max.iter`	Maximum number of iterations to let the algorithm run for.
`verbose`	Logical. If TRUE, will report progress.

Value

A list of class biclustermd:

`params`	a list of all arguments passed to the function, including defaults.
`data`	the inputted two way table of data.
`P0`	the initial column partition matrix.
`Q0`	the initial row partition matrix.
`InitialSSE`	the SSE of the original partitioning.
`P`	the final column partition matrix.
`Q`	the final row partition matrix.
`SSE`	a matrix of class biclustermd_sse detailing the SSE recorded at the end of each iteration.
`Similarities`	a data frame of class biclustermd_sim detailing the value of row and column similarity measures recorded at the end of each iteration. Contains information for all three similarity measures. This carries an attribute `"used"` which provides the similarity measure used as the stopping condition for the algorithm.
`iteration`	the number of iterations the algorithm ran for, whether `max.iter` was reached or convergence was achieved.
`A`	the final prototype matrix which gives the average of each bicluster.

References

Li, J., Reisner, J., Pham, H., Olafsson, S., and Vardeman, S. (2020) Biclustering with Missing Data. Information Sciences, 510, 304–316.

Examples

data("synthetic")
# default parameters
bc <- biclustermd(synthetic)
bc
autoplot(bc)

# providing the true number of row and column clusters
bc <- biclustermd(synthetic, col_clusters = 3, row_clusters = 2)
bc
autoplot(bc)

# an example with the nycflights13::flights dataset
library(nycflights13)
data("flights")

library(dplyr)
flights_bcd <- flights %>%
  select(month, dest, arr_delay)

flights_bcd <- flights_bcd %>%
  group_by(month, dest) %>%
  summarise(mean_arr_delay = mean(arr_delay, na.rm = TRUE)) %>%
  spread(dest, mean_arr_delay) %>%
  as.data.frame()

rownames(flights_bcd) <- flights_bcd$month
flights_bcd <- as.matrix(flights_bcd[, -1])

flights_bc <- biclustermd(data = flights_bcd, col_clusters = 6, row_clusters = 4,
                  row_min_num = 3, col_min_num = 5,
                  max.iter = 20, verbose = TRUE)
flights_bc

[Package biclustermd version 0.2.3 Index]

Bicluster data with non-random missing values

Description

Usage

Arguments

Value

References

See Also

Examples