latentcor {latentcor} | R Documentation |
Estimate latent correlation for mixed types.
Description
Estimation of latent correlation matrix from observed data of (possibly) mixed types (continuous/binary/truncated/ternary) based on the latent Gaussian copula model. Missing values (NA) are allowed. The estimation is based on pairwise complete observations.
Usage
latentcor(
X,
types = NULL,
method = c("approx", "original"),
use.nearPD = TRUE,
nu = 0.001,
tol = 1e-08,
ratio = 0.9,
showplot = FALSE
)
Arguments
X |
A numeric matrix or numeric data frame (n by p), where n is number of samples, and p is number of variables. Missing values (NA) are allowed, in which case the estimation is based on pairwise complete observations. |
types |
A vector of length p indicating the type of each of the p variables in |
method |
The calculation method for latent correlations. Either |
use.nearPD |
Logical indicator. |
nu |
Shrinkage parameter for the correlation matrix, must be between 0 and 1. Guarantees that the minimal eigenvalue of returned correlation matrix is greater or equal to |
tol |
When |
ratio |
When |
showplot |
Logical indicator. |
Details
The function estimates latent correlation by calculating inverse bridge function (Fan et al., 2017) evaluated at the value of sample Kendall's tau (\hat \tau
). The bridge function F connects Kendall's tau to latent correlation r so that F(r) = E(\hat \tau)
. The form of function F depends on the variable types (continuous/binary/truncated/ternary), but is exact. The exact form of inverse is not available, so has to be evaluated numerically for each pair of variables leading to Rpointwise
.
When method = "original"
, the inversion is done numerically by solving
minimize_r (F(r) - \hat \tau)^2
using optimize
. The parameter tol
is used to control the accuracy of the solution.
When method = "approx"
, the inversion is done approximately by interpolating previously calculated and stored values of F^{-1}(\hat \tau)
. This is significantly faster than the original method (Yoon et al., 2021) for binary/ternary/truncated cases, however the approximation errors may be non-negligible on some regions of the space. The parameter ratio
controls the region where the interpolation is performed with default recommended value of 0.9 giving a good balance of accuracy and computational speed . Increasing the value of ratio may improve speed (but possibly sacrifice the accuracy), whereas decreasing the value of ratio my improve accuracy (but possibly sacrifice the speed). See Yoon et al. 2021 and vignette for more details.
In case the pointwise estimator Rpointwise
is has negative eigenvalues, it is projected onto the space of positive semi-definite matrices using nearPD
. The parameter nu
further allows to perform additional shrinkage towards identity matrix (desirable in cases where the number of variables p is very large) as
R = (1 - \nu) \tilde R + \nu I,
where \tilde R
is Rpointwise
after projection by nearPD
.
Value
latentcor
returns
zratios: A list of of length p corresponding to each variable. Returns NA for continuous variable; proportion of zeros for binary/truncated variables; the cumulative proportions of zeros and ones (e.g. first value is proportion of zeros, second value is proportion of zeros and ones) for ternary variable.
K: (p x p) Kendall Tau (Tau-a) Matrix for
X
R: (p x p) Estimated latent correlation matrix for
X
Rpointwise: (p x p) Point-wise estimates of latent correlations for
X
. This matrix is not guaranteed to be semi-positive definite. This is the original estimated latent correlation matrix without adjustment for positive-definiteness.plotR: Heatmap plot of latent correlation matrix
R
, NULL ifshowplot = FALSE
References
Fan J., Liu H., Ning Y. and Zou H. (2017) "High dimensional semiparametric latent graphical model for mixed data" doi:10.1111/rssb.12168.
Yoon G., Carroll R.J. and Gaynanova I. (2020) "Sparse semiparametric canonical correlation analysis for data of mixed types" doi:10.1093/biomet/asaa007.
Yoon G., Müller C.L., Gaynanova I. (2021) "Fast computation of latent correlations" doi:10.1080/10618600.2021.1882468.
Examples
# Example 1 - truncated data type, same type for all variables
# Generate data
X = gen_data(n = 300, types = rep("tru", 5))$X
# Estimate latent correlation matrix with original method and check the timing
start_time = proc.time()
R_org = latentcor(X = X, types = "tru", method = "original")$R
proc.time() - start_time
# Estimate latent correlation matrix with approximation method and check the timing
start_time = proc.time()
R_approx = latentcor(X = X, types = "tru", method = "approx")$R
proc.time() - start_time
# Heatmap for latent correlation matrix.
Heatmap_R_approx = latentcor(X = X, types = "tru", method = "approx",
showplot = TRUE)$plotR
# Example 2 - ternary/continuous case
X = gen_data()$X
# Estimate latent correlation matrix with original method
R_nc_org = latentcor(X = X, types = c("ter", "con"), method = "original")$R
# Estimate latent correlation matrix with aprroximation method
R_nc_approx = latentcor(X = X, types = c("ter", "con"), method = "approx")$R