GVdist {PSinference}R Documentation

Generalized Variance Empirical Distribution

Description

This function calculates the empirical distribution of the pivotal random variable that can be used to perform inferential procedures for the Generalized Variance of the released Single Synthetic dataset generated under Plug-in Sampling, assuming that the original distribution is normally distributed.

Usage

GVdist(nsample, pvariates, iterations = 10000)

Arguments

nsample

Sample size.

pvariates

Number of variables.

iterations

Number of iterations for simulating values from the distribution and finding the quantiles. Default is 10000.

Details

We define

T_1^\star = (n-1)\frac{|\boldsymbol{S}^*|}{|\boldsymbol{\Sigma}|},

where \boldsymbol{S}^\star = \sum_{i=1}^n (v_i - \bar{v})(v_i - \bar{v})^{\top}, \boldsymbol{\Sigma} is the population covariance matrix and v_i is the ith observation of the synthetic dataset. Its distribution is stochastic equivalent to

\prod_{i=1}^n \chi_{n-i}^2 \prod_{i=1}^p \chi_{n-i}^2

where \chi_{n-i}^2 are all independent chi-square random variables. The (1-\alpha) level confidence interval for |\boldsymbol{\Sigma}| is given by

\left(\frac{(n-1)^p|\tilde{\boldsymbol{S}}^\star|}{t^\star_{1,1-\alpha/2}}, \frac{(n-1)^p|\tilde{\boldsymbol{S}}^\star|}{t^\star_{1,\alpha/2}} \right)

where \tilde{\boldsymbol{S}}^\star is the observed value of \boldsymbol{S}^\star and t^\star_{1,\gamma} is the \gammath percentile of T_1.

Value

a vector of length iterations that recorded the empirical distribution's values.

References

Klein, M., Moura, R. and Sinha, B. (2021). Multivariate Normal Inference based on Singly Imputed Synthetic Data under Plug-in Sampling. Sankhya B 83, 273–287.

Examples


# Original data creation
library(MASS)
mu <- c(1,2,3,4)
Sigma <- matrix(c(1, 0.5, 0.5, 0.5,
                  0.5, 1, 0.5, 0.5,
                  0.5, 0.5, 1, 0.5,
                  0.5, 0.5, 0.5, 1), nrow = 4, ncol = 4, byrow = TRUE)
seed = 1
n_sample = 100
# Create original simulated dataset
df = mvrnorm(n_sample, mu = mu, Sigma = Sigma)

# Synthetic data created

df_s = simSynthData(df)


# Gather the 0.025 and 0.975 quantiles and construct confident interval for sigma^2
# Check that sigma^2 is inside in both cases
p = dim(df_s)[2]

T <- GVdist(100, p, 10000)
q975 <- quantile(T, 0.975)
q025 <- quantile(T, 0.025)

left <- (n_sample-1)^p * det(cov(df_s)*(n_sample-1))/q975
right <- (n_sample-1)^p * det(cov(df_s)*(n_sample-1))/q025

cat(left,right,'\n')
print(det(Sigma))
# The synthetic value is inside the confidence interval of GV


[Package PSinference version 0.1.0 Index]