generate_mild_df {mildsvm}R Documentation

Generate mild_df using multivariate t and normal distributions.

Description

This function samples multiple instance distributional data (a mild_df object) where each row corresponds to a sample from a given instance distribution. Instance distributions can be multivariate t and normal, with mean and variance parameters that can be fixed or sampled based on prior parameters. These instances are grouped into bags and the bag labels follow the standard MI assumption.

Usage

generate_mild_df(
  nbag = 50,
  ninst = 4,
  nsample = 50,
  ncov = 10,
  nimp_pos = 1:ncov,
  nimp_neg = 1:ncov,
  positive_prob = 0.2,
  dist = c("mvt", "mvnormal", "mvnormal"),
  mean = list(rep(0, length(nimp_pos)), rep(0, length(nimp_neg)), 0),
  sd_of_mean = c(0.5, 0.5, 0.5),
  cov = list(diag(1, nrow = length(nimp_pos)), diag(1, nrow = length(nimp_neg)), 1),
  sample_cov = FALSE,
  df_wishart_cov = c(length(nimp_pos), length(nimp_neg), ncov - length(nimp_pos)),
  degree = c(3, NA, NA),
  positive_bag_prob = NULL,
  n_noise_inst = NULL,
  ...
)

Arguments

nbag

The number of bags (default 50).

ninst

The number of instances for each bag (default 4).

nsample

The number of samples for each instance (default 50).

ncov

The number of total covariates (default 10).

nimp_pos

An index of important covariates for positve instances (default 1:ncov).

nimp_neg

An index of important covariates for negative instances (default 1:ncov). (default 1:ncov).

positive_prob

A numeric value between 0 and 1 indicating the probability of an instance being positive (default 0.2).

dist

A vector (length 3) of distributions for the positive, negative, and remaining instances, respectively. Distributions can be one of 'mvnormal' for multivariate normal or 'mvt' for multivariate student's t.

mean

A list (length 3) of mean vectors for the positive, negative, and remaining distributions. mean[[1]] should match nimp_pos in length; mean[[2]] should match nimp_neg in length.

sd_of_mean

A vector (length 3) of standard deviations in sampling the mean for positive, negative, and remaining distributions, where the prior is given by mean. Use sd_of_mean = c(0, 0, 0) to keep the mean consistent across all instances.

cov

A list (length 3) of covariance matrices for the positive, negative, and remaining distributions. cov[[3]] should be an integer since the dimension of remaining features can vary depending on if the important distribution is positive or negative.

sample_cov

A logical value for whether to sample the covariance for each distribution. If FALSE (the default), each covariance is fixed at cov. If TRUE, the prior is given by cov and sampled from a Wishart distribution with df_wishart_cov degrees of freedom to have an expectation of cov.

df_wishart_cov

A vector (length 3) of degrees-of-freedom to use in the Wishart covariance matrix sampling.

degree

A vector (length 3) of degrees-of-freedom used when any of dist is 'mvt'. This parameter is ignored when dist[i] == 'mvnormal', in which case NA can be specified.

positive_bag_prob

A numeric value between 0 and 1 indicating the probability of a bag being positive. Must be specified jointly with n_noise_inst, in which case positive_prob is ignored. If NULL (the default), instance labels are sampled first according to positive_prob.

n_noise_inst

An integer indicating the number of negative instances in a positive bag. Must be specified jointly with positive_bag_prob. n_noise_inst should be less than ninst.

...

Arguments passed to or from other methods.

Details

The first consideration to use this function is to determine the number of bags, instances per bag, and samples per instance using the nbag, ninst, and nsample arguments. Next, one must consider the number of covariates ncov, and how those covariates will differ between instances with positive and negative labels. Some covariates can be common between the positive and negative instances, which we call the remainder distribution. Use nimp_pos and nimp_neg to specify the index of the important (non-remainder) covariates in the distributions with positive and negative instance labels.

The structure of how many instances/bags are positive and negative is determined by positive_prob or the joint specification of positive_bag_prob and n_noise_inst. In the first case, instances labels have independent Bernoulli draws based on positive_prob and bag labels are determined by the standard MI assumption (i.e. positive if any instance in the bag is positive). In the second case, bag labels are drawn independently as Bernoilli with positive_bag_prob chance of success. Each positive bag will be given n_noise_inst values with instance label of 0, and the remaining with instance label of 1.

The remaining arguments are used to determine the distributions used for the positive, negative, and remaining features. Each argument will be a vector of list of length 3 corresponding to these 3 different groups. To create different distributions, the strategy is to first draw the mean parameter from Normal(mean, sd_of_mean * I) and the covariance parameter from Wishart(df_wishart_cov, cov), with expectation equal to cov. Then we can sample i.i.d. draws from the specified distribution (either multivariate normal or student's t). To ensure that each instance distribution has the same mean, set sd_of_mean to 0. To ensure that each instance distribution has the same covariance, set sample_cov = FALSE.

The final data.frame will have nsample * nbag * ninst rows and ncov + 3 columns including the bag_label, bag_name, instance_name, and ncov sampled covariates.

Value

A mild_df object.

Author(s)

Yifei Liu, Sean Kent

Examples

set.seed(8)
mild_data <- generate_mild_df(nbag = 7, ninst = 3, nsample = 20,
                              ncov = 2,
                              nimp_pos = 1,
                              dist = rep("mvnormal", 3),
                              mean = list(
                                rep(5, 1),
                                rep(15, 2),
                                0
                              ))

library(dplyr)
distinct(mild_data, bag_label, bag_name, instance_name)
split(mild_data[, 4:5], mild_data$instance_name) %>%
  sapply(colMeans) %>%
  round(2) %>%
  t()

[Package mildsvm version 0.4.0 Index]