Inddist {PSinference}R Documentation

Independence Empirical Distribution

Description

This function calculates the empirical distribution of the pivotal random variable that can be used to perform inferential procedures and test the independence of two subsets of variables based on the released Single Synthetic data generated under Plug-in Sampling, assuming that the original dataset is normally distributed.

Usage

Inddist(part, nsample, pvariates, iterations)

Arguments

part

Number of partitions.

nsample

Sample size.

pvariates

Number of variables.

iterations

Number of iterations for simulating values from the distribution and finding the quantiles. Default is 10000.

Details

We define

T3=SS11S22T_3^\star = \frac{|\boldsymbol{S}^{\star}|} {|\boldsymbol{S}^{\star}_{11}||\boldsymbol{S}^{\star}_{22}|}

where S=i=1n(vivˉ)(vivˉ)\boldsymbol{S}^\star = \sum_{i=1}^n (v_i - \bar{v})(v_i - \bar{v})^{\top}, viv_i is the iith observation of the synthetic dataset, considering S\boldsymbol{S}^\star partitioned as

S=[S11S12S21S22].\boldsymbol{S}^{\star}=\left[\begin{array}{lll} \boldsymbol{S}^{\star}_{11}& \boldsymbol{S}^{\star}_{12}\\ \boldsymbol{S}^{\star}_{21} & \boldsymbol{S}^{\star}_{22} \end{array}\right].

Under the assumption that Σ12=0\boldsymbol{\Sigma}_{12} = \boldsymbol{0}, its distribution is stochastic equivalent to

ΩΩ11Ω22\frac{|\boldsymbol{\Omega}|}{|\boldsymbol{\Omega}_{11}||\boldsymbol{\Omega}_{22}|}

where ΩWp(n1,Wn1)\boldsymbol{\Omega} \sim \mathcal{W}_p(n-1, \frac{\boldsymbol{W}}{n-1}), WWp(n1,Ip)\boldsymbol{W} \sim \mathcal{W}_p(n-1, \mathbf{I}_p) and Ω\boldsymbol{\Omega} partitioned in the same way as S\boldsymbol{S}^{\star}. To test H0:Σ12=0\mathcal{H}_0: \boldsymbol{\Sigma}_{12} = \boldsymbol{0}, compute the value of T3T_{3}^\star, T3~\widetilde{T_{3}^\star}, with the observed values and reject the null hypothesis if T3~<t3,α\widetilde{T_{3}^\star}<t^\star_{3,\alpha} for α\alpha-significance level, where t3,γt^\star_{3,\gamma} is the γ\gammath percentile of T3T_3^\star.

Value

a vector of length iterations that recorded the empirical distribution's values.

References

Klein, M., Moura, R. and Sinha, B. (2021). Multivariate Normal Inference based on Singly Imputed Synthetic Data under Plug-in Sampling. Sankhya B 83, 273–287.

Examples

#generate original data with two independent subsets of variables
library(MASS)
n_sample = 100
p = 4
mu <- c(1,2,3,4)
Sigma = matrix(c(1,   0.5,   0,     0,
                 0.5,   2,   0,     0,
                 0,     0,   3,   0.2,
                 0,     0,   0.2,   4), nr = 4, nc = 4, byrow = TRUE)
df = mvrnorm(n_sample, mu = mu, Sigma = Sigma)
# generate synthetic data
df_s = simSynthData(df)

#Decompose Sstar in 4 parts
part = 2

Sstar = cov(df_s)
Sstar_11 = partition(Sstar,nrows = part, ncol = part)[[1]]
Sstar_12 = partition(Sstar,nrows = part, ncol = part)[[2]]
Sstar_21 = partition(Sstar,nrows = part, ncol = part)[[3]]
Sstar_22 = partition(Sstar,nrows = part, ncol = part)[[4]]

#Compute observed T3_star
T3_obs = det(Sstar)/(det(Sstar_11)*det(Sstar_22))

alpha = 0.05

# colect the quantile from the distribution assuming independence between the two subsets
T3 <- Inddist(part = part, nsample = n_sample, pvariates = p, iterations = 10000)
q5 <- quantile(T3, alpha)

T3_obs < q5 #False means that we don't have statistical evidences to reject independence
print(T3_obs)
print(q5)
# Note that the value of the observed T3_obs is close to one as expected

[Package PSinference version 0.1.0 Index]