generate_mild_df {mildsvm} | R Documentation |
Generate mild_df using multivariate t and normal distributions.
Description
This function samples multiple instance distributional data (a mild_df
object) where each row corresponds to a sample from a given instance
distribution. Instance distributions can be multivariate t and normal, with
mean and variance parameters that can be fixed or sampled based on prior
parameters. These instances are grouped into bags and the bag labels
follow the standard MI assumption.
Usage
generate_mild_df(
nbag = 50,
ninst = 4,
nsample = 50,
ncov = 10,
nimp_pos = 1:ncov,
nimp_neg = 1:ncov,
positive_prob = 0.2,
dist = c("mvt", "mvnormal", "mvnormal"),
mean = list(rep(0, length(nimp_pos)), rep(0, length(nimp_neg)), 0),
sd_of_mean = c(0.5, 0.5, 0.5),
cov = list(diag(1, nrow = length(nimp_pos)), diag(1, nrow = length(nimp_neg)), 1),
sample_cov = FALSE,
df_wishart_cov = c(length(nimp_pos), length(nimp_neg), ncov - length(nimp_pos)),
degree = c(3, NA, NA),
positive_bag_prob = NULL,
n_noise_inst = NULL,
...
)
Arguments
nbag |
The number of bags (default 50). |
ninst |
The number of instances for each bag (default 4). |
nsample |
The number of samples for each instance (default 50). |
ncov |
The number of total covariates (default 10). |
nimp_pos |
An index of important covariates for positve instances
(default |
nimp_neg |
An index of important covariates for negative instances
(default |
positive_prob |
A numeric value between 0 and 1 indicating the probability of an instance being positive (default 0.2). |
dist |
A vector (length 3) of distributions for the positive, negative, and
remaining instances, respectively. Distributions can be one of
|
mean |
A list (length 3) of mean vectors for the positive, negative, and
remaining distributions. |
sd_of_mean |
A vector (length 3) of standard deviations in sampling the
mean for positive, negative, and remaining distributions, where the prior
is given by |
cov |
A list (length 3) of covariance matrices for the positive,
negative, and remaining distributions. |
sample_cov |
A logical value for whether to sample the covariance for
each distribution. If |
df_wishart_cov |
A vector (length 3) of degrees-of-freedom to use in the Wishart covariance matrix sampling. |
degree |
A vector (length 3) of degrees-of-freedom used when any of
|
positive_bag_prob |
A numeric value between 0 and 1 indicating the
probability of a bag being positive. Must be specified jointly with
|
n_noise_inst |
An integer indicating the number of negative instances in
a positive bag. Must be specified jointly with |
... |
Arguments passed to or from other methods. |
Details
The first consideration to use this function is to determine the number of
bags, instances per bag, and samples per instance using the nbag
, ninst
,
and nsample
arguments. Next, one must consider the number of covariates
ncov
, and how those covariates will differ between instances with positive
and negative labels. Some covariates can be common between the positive and
negative instances, which we call the remainder distribution. Use nimp_pos
and nimp_neg
to specify the index of the important (non-remainder)
covariates in the distributions with positive and negative instance labels.
The structure of how many instances/bags are positive and negative is
determined by positive_prob
or the joint specification of
positive_bag_prob
and n_noise_inst
. In the first case, instances labels
have independent Bernoulli draws based on positive_prob
and bag labels are
determined by the standard MI assumption (i.e. positive if any instance in
the bag is positive). In the second case, bag labels are drawn independently
as Bernoilli with positive_bag_prob
chance of success. Each positive bag
will be given n_noise_inst
values with instance label of 0, and the
remaining with instance label of 1.
The remaining arguments are used to determine the distributions used for the
positive, negative, and remaining features. Each argument will be a vector
of list of length 3 corresponding to these 3 different groups. To create
different distributions, the strategy is to first draw the mean parameter
from Normal(mean
, sd_of_mean
* I) and the covariance parameter from
Wishart(df_wishart_cov
, cov
), with expectation equal to cov
. Then we
can sample i.i.d. draws from the specified distribution (either multivariate
normal or student's t). To ensure that each instance distribution has the
same mean, set sd_of_mean
to 0. To ensure that each instance distribution
has the same covariance, set sample_cov = FALSE
.
The final data.frame will have nsample
* nbag
* ninst
rows and ncov + 3
columns including the bag_label, bag_name, instance_name, and ncov
sampled covariates.
Value
A mild_df
object.
Author(s)
Yifei Liu, Sean Kent
Examples
set.seed(8)
mild_data <- generate_mild_df(nbag = 7, ninst = 3, nsample = 20,
ncov = 2,
nimp_pos = 1,
dist = rep("mvnormal", 3),
mean = list(
rep(5, 1),
rep(15, 2),
0
))
library(dplyr)
distinct(mild_data, bag_label, bag_name, instance_name)
split(mild_data[, 4:5], mild_data$instance_name) %>%
sapply(colMeans) %>%
round(2) %>%
t()