R: GEOmetric Density Estimation.

rgeode {RGeode}

R Documentation

GEOmetric Density Estimation.

Description

It selects the principal directions of the data and performs inference. Moreover GEODE is also able to handle missing data.

Usage

rgeode(Y, d = 6, burn = 1000, its = 2000, tol = 0.01, atau = 1/20,
  asigma = 1/2, bsigma = 1/2, starttime = NULL, stoptime = NULL,
  fast = TRUE, c0 = -1, c1 = -0.005)

Arguments

`Y`	array_like a real input matrix (or data frame), with dimensions `(n, D)`. It is the real matrix of data.
`d`	int, optional it is the conservative upper bound for the dimension D. We are confident that the real dimension is smaller then it.
`burn`	int, optional number of burn-in to perform in our Gibbs sampler. It represents also the stopping time that stop the choice of the principal axes.
`its`	int, optional number of iterations that must be performed after the burn-in.
`tol`	double, optional threshold for adaptively removing redundant dimensions. It is used compared with the ratio: `\frac{\alpha_j^2(t)}{\max \alpha_i^2(t)}`.
`atau`	double, optional The parameter `a_\tau` of the truncated Exponential (the prior for `\tau_j`).
`asigma`	double, optional The shape parameter `a_\sigma` of the truncated Gamma (the prior for `\sigma^2`).
`bsigma`	double, optional The rate parameter `b_\sigma` of the truncated Gamma (the prior for `\sigma^2`).
`starttime`	int, optional starting time for adaptive pruning. It must be less then the number of burn-in.
`stoptime`	int, optional stop time for adaptive pruning. It must be less then the number of burn-in.
`fast`	bool, optional If `TRUE` it is run using fast d-rank SVD. Otherwise it uses the classical SVD.
`c0`	double, optional Additive constant for the exponent of the pruning step.
`c1`	double, optional Multiplicative constant for the exponent of the pruning step.

Details

GEOmetric Density Estimation (rgeode) is a fast algorithm performing inference on normally distributed data. It is essentially divided in two principal steps:

Selection of the principal axes of the data.
Adaptive Gibbs sampler with the creation of a set of samples from the full conditional posteriors of the parameters of interest, which enable us to perform inference.

It takes in inputs several quantities. A rectangular (N,D) matrix Y, on which we will run a Fast rank d SVD. The conservative upper bound of the true dimension of our data d. A set of tuning parameters. We remark that the choice of the conservative upper bound d must be such that d>p, with p real dimension, and d << D.

Value

rgeode returns a list containing the following components:

`InD`	array_like The chose principal axes.
`u`	matrix Containing the sample from the full conditional posterior of `u_j`s. We store each iteration on the columns.
`tau`	matrix Containing the sample from the full conditional posterior of `tau_j`s.
`sigmaS`	array_like Containing the sample from the full conditional posterior of `sigma`.
`W`	matrix Containing the principal singular vectors.
`Miss`	list Containing all the informations about missing data. If there are not missing data this output is not provide. id_m array It contains the set of rows with missing data. pos_m list It contains the set of missing data positions for each row with missing values. yms list The list contained the pseudo-observation substituting our missing data. Each element of the list represents the simulated data for that time.

Note

The part related to the missing data is filled only in the case in which we have missing data.

Author(s)

L. Rimella, lorenzo.rimella@hotmail.it

References

[1] Y. Wang, A. Canale, D. Dunson. "Scalable Geometric Density Estimation" (2016).

Examples


library(MASS)
library(RGeode)

####################################################################
# WITHOUT MISSING DATA
####################################################################
# Define the dataset
D= 200
n= 500
d= 10
d_true= 3

set.seed(321)

mu_true= runif(d_true, -3, 10)

Sigma_true= matrix(0,d_true,d_true)
diag(Sigma_true)= c(runif(d_true, 10, 100))

W_true = svd(matrix(rnorm(D*d_true, 0, 1), d_true, D))$v

sigma_true = abs(runif(1,0,1))

mu= W_true%*%mu_true
C= W_true %*% Sigma_true %*% t(W_true)+ sigma_true* diag(D)

y= mvrnorm(n, mu, C)

################################
# GEODE: Without missing data
################################

start.time <- Sys.time() 
GEODE= rgeode(Y= y, d)
Sys.time()- start.time

# SIGMAS
#plot(seq(110,3000,by=1),GEODE$sigmaS[110:3000],ty='l',col=2,
#     xlab= 'Iteration', ylab= 'sigma^2', main= 'Simulation of sigma^2')
#abline(v=800,lwd= 2, col= 'blue')
#legend('bottomright',c('Posterior of sigma^2', 'Stopping time'),
#       lwd=c(1,2),col=c(2,4),cex=0.55, border='black', box.lwd=3)
       
       
####################################################################
# WITH MISSING DATA
####################################################################

###########################
#Insert NaN
n_m = 5 #number of data vectors containing missing features
d_m = 1  #number of missing features

data_miss= sample(seq(1,n),n_m)

features= sample(seq(1,D), d_m)
for(i in 2:n_m)
{
  features= rbind(features, sample(seq(1,D), d_m))
}

for(i in 1:length(data_miss))
{
  
  if(i==length(data_miss))
  {
    y[data_miss[i],features[i,][-1]]= NaN
  }
  else
  {
    y[data_miss[i],features[i,]]= NaN
  }
  
}

################################
# GEODE: With missing data
################################
set.seed(321)
start.time <- Sys.time() 
GEODE= rgeode(Y= y, d)
Sys.time()- start.time

# SIGMAS
#plot(seq(110,3000,by=1),GEODE$sigmaS[110:3000],ty='l',col=2,
#     xlab= 'Iteration', ylab= 'sigma^2', main= 'Simulation of sigma^2')
#abline(v=800,lwd= 2, col= 'blue')
#legend('bottomright',c('Posterior of sigma^2', 'Stopping time'),
#       lwd=c(1,2),col=c(2,4),cex=0.55, border='black', box.lwd=3)



####################################################################
####################################################################

[Package RGeode version 0.1.0 Index]