Inddist {PSinference} | R Documentation |
Independence Empirical Distribution
Description
This function calculates the empirical distribution of the pivotal random variable that can be used to perform inferential procedures and test the independence of two subsets of variables based on the released Single Synthetic data generated under Plug-in Sampling, assuming that the original dataset is normally distributed.
Usage
Inddist(part, nsample, pvariates, iterations)
Arguments
part |
Number of partitions. |
nsample |
Sample size. |
pvariates |
Number of variables. |
iterations |
Number of iterations for simulating values from the distribution and finding the quantiles. Default is |
Details
We define
T_3^\star =
\frac{|\boldsymbol{S}^{\star}|}
{|\boldsymbol{S}^{\star}_{11}||\boldsymbol{S}^{\star}_{22}|}
where \boldsymbol{S}^\star = \sum_{i=1}^n (v_i - \bar{v})(v_i - \bar{v})^{\top}
,
v_i
is the i
th observation of the synthetic dataset,
considering \boldsymbol{S}^\star
partitioned as
\boldsymbol{S}^{\star}=\left[\begin{array}{lll}
\boldsymbol{S}^{\star}_{11}& \boldsymbol{S}^{\star}_{12}\\
\boldsymbol{S}^{\star}_{21} & \boldsymbol{S}^{\star}_{22}
\end{array}\right].
Under the assumption that \boldsymbol{\Sigma}_{12} = \boldsymbol{0}
,
its distribution is stochastic equivalent to
\frac{|\boldsymbol{\Omega}|}{|\boldsymbol{\Omega}_{11}||\boldsymbol{\Omega}_{22}|}
where \boldsymbol{\Omega} \sim \mathcal{W}_p(n-1, \frac{\boldsymbol{W}}{n-1})
,
\boldsymbol{W} \sim \mathcal{W}_p(n-1, \mathbf{I}_p)
and
\boldsymbol{\Omega}
partitioned in the same way as
\boldsymbol{S}^{\star}
.
To test \mathcal{H}_0: \boldsymbol{\Sigma}_{12} = \boldsymbol{0}
,
compute the value of T_{3}^\star
, \widetilde{T_{3}^\star}
,
with the observed values and reject the null hypothesis if
\widetilde{T_{3}^\star}<t^\star_{3,\alpha}
for
\alpha
-significance level, where t^\star_{3,\gamma}
is the
\gamma
th percentile of T_3^\star
.
Value
a vector of length iterations
that recorded the empirical distribution's values.
References
Klein, M., Moura, R. and Sinha, B. (2021). Multivariate Normal Inference based on Singly Imputed Synthetic Data under Plug-in Sampling. Sankhya B 83, 273–287.
Examples
#generate original data with two independent subsets of variables
library(MASS)
n_sample = 100
p = 4
mu <- c(1,2,3,4)
Sigma = matrix(c(1, 0.5, 0, 0,
0.5, 2, 0, 0,
0, 0, 3, 0.2,
0, 0, 0.2, 4), nr = 4, nc = 4, byrow = TRUE)
df = mvrnorm(n_sample, mu = mu, Sigma = Sigma)
# generate synthetic data
df_s = simSynthData(df)
#Decompose Sstar in 4 parts
part = 2
Sstar = cov(df_s)
Sstar_11 = partition(Sstar,nrows = part, ncol = part)[[1]]
Sstar_12 = partition(Sstar,nrows = part, ncol = part)[[2]]
Sstar_21 = partition(Sstar,nrows = part, ncol = part)[[3]]
Sstar_22 = partition(Sstar,nrows = part, ncol = part)[[4]]
#Compute observed T3_star
T3_obs = det(Sstar)/(det(Sstar_11)*det(Sstar_22))
alpha = 0.05
# colect the quantile from the distribution assuming independence between the two subsets
T3 <- Inddist(part = part, nsample = n_sample, pvariates = p, iterations = 10000)
q5 <- quantile(T3, alpha)
T3_obs < q5 #False means that we don't have statistical evidences to reject independence
print(T3_obs)
print(q5)
# Note that the value of the observed T3_obs is close to one as expected