dataset {CVarE}R Documentation

Generates test datasets.

Description

Provides sample datasets M1-M7 used in the paper Conditional variance estimation for sufficient dimension reduction, Lukas Fertl, Efstathia Bura. The general model is given by:

Y = g(B'X) + ε

Usage

dataset(name = "M1", n = NULL, p = 20, sd = 0.5, ...)

Arguments

name

One of "M1", "M2", "M3", "M4", "M5", "M6" or "M7". Alternative just the dataset number 1-7.

n

number of samples.

p

Dimension of random variable X.

sd

standard diviation for error term ε.

...

Additional parameters only for "M2" (namely pmix and lambda), see: below.

Value

List with elements

M1

The predictors are distributed as X ~ N_p(0, Σ) with Σ_ij = 0.5^|i - j| for i, j = 1,..., p for a subspace dimension of k = 1 with a default of n = 100 data points. p = 20, b_1 = (1,1,1,1,1,1,0,...,0)' / sqrt(6), and Y is given as

Y = cos(b_1'X) + ε

where ε is distributed as generalized normal distribution with location 0, shape-parameter 0.5, and the scale-parameter is chosen such that Var(ε) = 0.5.

M2

The predictors are distributed as X ~ Z 1_p λ + N_p(0, I_p). with Z~2Binom(pmix)-1 where 1_p is the p-dimensional vector of one's, for a subspace dimension of k = 1 with a default of n = 100 data points. p = 20, b_1 = (1,1,1,1,1,1,0,...,0)' / sqrt(6), and Y is

Y = cos(b_1'X) + 0.5ε

where ε is standard normal. Defaults for pmix is 0.3 and lambda defaults to 1.

M3

The predictors are distributed as X~N_p(0, I_p) for a subspace dimension of k = 1 with a default of n = 100 data points. p = 20, b_1 = (1,1,1,1,1,1,0,...,0)' / sqrt(6), and Y is

Y = 2 log(|b_1'X| + 2) + 0.5ε

where ε is standard normal.

M4

The predictors are distributed as X~N_p(0,Σ) with Σ_ij = 0.5^|i - j| for i, j = 1,..., p for a subspace dimension of k = 2 with a default of n = 100 data points. p = 20, b_1 = (1,1,1,1,1,1,0,...,0)' / sqrt(6), b_2 = (1,-1,1,-1,1,-1,0,...,0)' / sqrt(6) and Y is given as

Y = (b_1'X) / (0.5 + (1.5 + b_2'X)^2) + 0.5ε

where ε is standard normal.

M5

The predictors are distributed as X~U([0, 1]^p) where U([0, 1]^p) is the uniform distribution with independent components on the p-dimensional hypercube for a subspace dimension of k = 2 with a default of n = 200 data points. p = 20, b_1 = (1,1,1,1,1,1,0,...,0)' / sqrt(6), b_2 = (1,-1,1,-1,1,-1,0,...,0)' / sqrt(6) and Y is given as

Y = cos(π b_1'X)(b_2'X + 1)^2 + 0.5ε

where ε is standard normal.

M6

The predictors are distributed as X~N_p(0, I_p) for a subspace dimension of k = 3 with a default of n = 200 data point. p = 20, b_1 = e_1, b_2 = e_2, and b_3 = e_p, where e_j is the j-th unit vector in the p-dimensional space. Y is given as

Y = (b_1'X)^2+(b_2'X)^2+(b_3'X)^2+0.5ε

where ε is standard normal.

M7

The predictors are distributed as X~t_3(I_p) where t_3(I_p) is the standard multivariate t-distribution with 3 degrees of freedom, for a subspace dimension of k = 4 with a default of n = 200 data points. p = 20, b_1 = e_1, b_2 = e_2, b_3 = e_3, and b_4 = e_p, where e_j is the j-th unit vector in the p-dimensional space. Y is given as

Y = (b_1'X)(b_2'X)^2+(b_3'X)(b_4'X)+0.5ε

where ε is distributed as generalized normal distribution with location 0, shape-parameter 1, and the scale-parameter is chosen such that Var(ε) = 0.25.

References

Fertl, L. and Bura, E. (2021) "Conditional Variance Estimation for Sufficient Dimension Reduction" <arXiv:2102.08782>


[Package CVarE version 1.1 Index]