R: Estimation of the Spectral Measure

spectralmeas {copBasic}

R Documentation

Estimation of the Spectral Measure

Description

Kiriliouk et al. (2016, pp. 360–364) describe estimation of the spectral measure of bivariate data. Standardize the bivariate data as X^\star and Y^\star as in psepolar and select a “large” value for the pseudo-polar radius S_f for nonexceedance probability f. Estimate the spectral measure H(w), which is the limiting distribution of the pseudo-polar angle component W given that the corresponding radial component S is large:

\mathrm{Pr}[W \in \cdot | S > S_f] \rightarrow H(w) \mbox{\ as\ } S_f \rightarrow \infty\mbox{.}

So, H(w) is the cumulative distribution function of the spectral measure for angle w \in (0,1). The S_f can be specified by a nonexceedance probability F for S_f(F).

The estimation proceeds as follows:

Step 1: Convert the bivariate data (X_i, Y_i) into (\widehat{S}_i, \widehat{W}_i) by psepolar and set the threshold S_f according to “n/k” (this part involving k does not make sense in Kiriliouk et al. (2016)) where for present implementation in copBasic the S_f given the f by the user is based on the empirical distribution of the \widehat{S}_i. The empirical distribution is estimated by the Bernstein empirical distribution function from the lmomco package.

Step 2: Let I_n denote the set of indices that correspond to the observations when \widehat{S}_i \ge S_f and compute N_n as the cardinality of N_n = |I_n|, which simply means the length of the vector I_n.

Step 3: Use the maximum Euclidean likelihood estimator, which is the third of three methods mentioned by Kiriliouk et al. (2016):

\widehat{H}_3(w) = \sum_{i \in I_n} \hat{p}_{3,i} \times \mathbf{1}[\widehat{W}_i \le w ]\mbox{,}

where \mathbf{1}[\cdot] is an indicator function that is only triggered if smooth=FALSE, and following the notation of Kiriliouk et al. (2016), the “3” represents maximum Euclidean likelihood estimation. The \hat{p}_{3,i} are are the weights

\hat{p}_{3,i} = \frac{1}{N_n}\bigl[ 1 - (\overline{W} - 1/2)S^{-2}_W(\widehat{W}_i - \overline{W}) \bigr]\mbox{,}

where \overline{W} is the sample mean and S^2_W is the sample variance of \widehat{W}_i

\overline{W} = \frac{1}{N_n} \sum_{i \in I_n} \widehat{W}_i\mbox{\quad and\quad } S^2_W = \frac{1}{N_n - 1} \sum_{i \in I_n} (\widehat{W}_i - \overline{W})^2\mbox{,}

where Kiriliouk et al. (2016, p. 363) do not show the N_n - 1 in the denominator for the variance but copBasic uses it because those authors state that the sample variance is used.

Step 4: A smoothed version of \hat{H}_3(w) is optionally available by

\tilde{H}_3(w) = \sum_{i \in I_n} \hat{p}_{3,i} \times \mathcal{B}(w;\, \widehat{W}_i\nu,\, (1-\widehat{W}_i)\nu)\mbox{,}

where \mathcal{B}(x; p, q) is the cumulative distribution function of the Beta distribution for p, q > 0 and where \nu > 0 is a smoothing parameter that can be optimized by cross validation.

Step 5: The spectral density lastly can be computed optionally as

\tilde{h}_3(w) = \sum_{i \in I_n} \hat{p}_{3,i} \times \beta(w;\, \widehat{W}_i\nu,\, (1-\widehat{W}_i)\nu)

where \beta(x; p, q) is the probability density function (pdf) of the Beta distribution. Readers are alerted to the absence of the \mathbf{1}[\cdot] indicator function in the definitions of \tilde{H}_3(w) and \tilde{h}_3(w). This is correct and matches Kiriliouk et al. (2016, eqs. 17.21 and 17.22) though this author was confused for a day or so by the indicator function in what is purported to be the core definition of \hat{H}_l(w) where l = 3 in Kiriliouk et al. (2016, eq. 17.21 and 17.17).

Usage

spectralmeas(u, v=NULL, w=NULL, f=0.90, snv=FALSE,
                                smooth=FALSE, nu=100, pdf=FALSE, ...)

Arguments

`u`	Nonexceedance probability `u` in the `X` direction (actually the ranks are used so this can be a real-value argument as well);
`v`	Nonexceedance probability `v` in the `Y` direction (actually the ranks are used so this can be a real-value argument as well) and if `NULL` then `u` is treated as a two column R `data.frame`;
`w`	A vector of polar angle values `W \in [0,1]` on which to compute the `H(w)`;
`f`	The nonexceedance probability `F(S_f)` of the pseudo-polar radius in `psepolar`;
`snv`	Return the standard normal variate of the `H` by the well-known transform through the quantile function of the standard normal, `qnorm()`;
`smooth`	A logical to return `\tilde{H}_3(w)` instead of `H_3(w)`;
`nu`	The `\nu > 0` smoothing parameter;
`pdf`	A logical to return the smoothed probability density `\tilde{h}_3(w)`. If `pdf=TRUE`, then internally `smooth=TRUE` will be set; and
`...`	Additional arguments to pass to the `psepolar` function.

Value

An R vector of H_3(w), \tilde{H}_3(w), or \tilde{h}_3(w) is returned.

Note

The purpose of this section is to describe a CPU-intensive study of goodness-of-fit between a Gumbel–Hougaard copula (GHcop, \mathbf{GH}(u,v; \Theta_1)) parent and a fitted Hüsler–Reiss copula (HRcop, \mathbf{HR}(u,v; \Theta_2)). Both of these copulas are extreme values and are somewhat similar to each other, so sample sizes necessary for detection of differences should be large. A two-sided Kolmogorov–Smirnov tests (KS test, ks.test()) is used to measure significance in the differences between the estimated spectral measure distributions at f=0.90 (the 90th percentile, F(S_f)) into the right tail.

The true copula will be the \mathbf{GH}(\Theta_1) having parameter \Theta_1 = 3.3. The number of simulations per sample size n \in seq(50,1000, by=25) is nsim = 500. For each sample size, a sample from the true parent is drawn, and a \mathbf{HR}(\Theta_2) fit by maximum likelihood (mleCOP). The two spectral measure distributions (\widehat{H}_{\mathbf{GH}}(w), Htru and \widehat{H}_{\mathbf{HR}}(w), Hfit) are estimated for a uniform variate of the angle W having length equal to the applicable sample size. The Kolmogorov–Smirnov (KS) test is made between Htru and Hfit, and number of p-values less than the \beta = 0.05 (Type II error because alternative hypothesis is rigged as true) and simulation count are returned and written in file Results.txt. The sample sizes initially are small and traps of NaN (abandonment of a simulation run) are made. These traps are needed when the empirical distribution of Htru or Hfit degenerates.

  Results <- NULL
  true.par <- 3.3; true.cop <- GHcop; fit.cop <- HRcop; search <- c(0,100)
  nsim <- 20000; first_time <- TRUE; f <- 0.90; beta <- 0.05
  ns <- c(seq(100,1000, by=50), 1250, 1500, 1750, 2000)
  for(n in ns) {
    W <- sort(runif(n)); PV <- vector(mode="numeric")
    for(i in 1:(nsim/(n/2))) {
      UV      <- simCOP(n=n, cop=true.cop, para=true.par, graphics=FALSE)
      fit.par <- mleCOP(UV,  cop= fit.cop, interval=search)$para
      UVfit   <- simCOP(n=n, cop= fit.cop, para=fit.par,  graphics=FALSE)
      Htru    <- spectralmeas(UV,    w=W, bound.type="Carv", f=f)
      Hfit    <- spectralmeas(UVfit, w=W, bound.type="Carv", f=f)
      if(length(Htru[! is.nan(Htru)]) != length(Hfit[! is.nan(Hfit)]) |
         length(Htru[! is.nan(Htru)]) == 0 |
         length(Hfit[! is.nan(Hfit)]) == 0) {
         PV[i] <- NA; next
      }   # suppressWarnings() to silence ties warnings from ks.test()
      KS <- suppressWarnings( stats::ks.test(Htru, Hfit)$p.value )
      #plot(FF, H, type="l"); lines(FF, Hfit, col=2); mtext(KS)
      message("-",i, appendLF=FALSE)
      PV[i] <- KS
    }
    message(":",n)
    zz <- data.frame(SampleSize=n, NumPVle0p05=sum(PV[! is.na(PV)] <= beta),
                            SimulationCount=length(PV[! is.na(PV)]))
    if(first_time) { Results <- zz; first_time <- FALSE; next }
    Results <- rbind(Results, zz)
  }

  plot(Results$SampleSize, 100*Results$NumPVle0p05/Results$SimulationCount,
       type="b", cex=1.1, xlab="Sample size",
       ylab="Percent simulations with p-value < 0.05")

The Results show a general increase in the counts of p-value \le 0.05 as sample size increases. There is variation of course and increasing nsim would smooth that considerably. The results show for n \approx 1{,}000 that the detection of statistically significant differences for extremal F(S_f) = 0.90 dependency between the \mathbf{GH}(\Theta_1{=}3.3) and \mathbf{HR}(\Theta_2) are detected at the error rate implied by the specified \beta = 0.05.

This range in sample size can be compared to the Kullback–Leibler sample size (n_{fg}):

  UV      <- simCOP(n=10000, cop=true.cop, para=true.par, graphics=FALSE)
  fit.par <- mleCOP(UV,      cop= fit.cop, interval=search)$para
  kullCOP(cop1=true.cop, para1=true.par,
          cop2=fit.cop,  para2=fit.par)$KL.sample.size
  # The Kullback-Leibler (integer) sample size for detection of differences at
  # alpha=0.05 are n_fg = (742, 809, 815, 826, 915, 973, 1203) for seven runs
  # Do more to see variation.

where the Kullback–Leilber approach is to measure density departures across the whole \mathcal{I}^2 domain as opposed to extremal dependency in the right tail as does the spectral measure.

Different runs of the above code will result in different n_{fg} in part because of simulation differences internal to kullCOP but also because the \Theta_2 has its own slight variation in its fit to the large sample simulation (n=10{,}000) of the parent. However, it seems that n_{fg} \approx 900 will be on the order of the n for which the KS test on the spectral measure determines statistical significance with similar error rate.

Now if the aforementioned simulation run is repeated for F(S_f) = 0.95 or f=0.95, the n_{fg} obviously remains unchanged at about 900 but the n for which the error rate is about \beta = 0.05 is n \approx 600. This sample size is clearly smaller than before and smaller than n_{fg}, therefore, the analysis of the empirical spectral measure deeper into the tail F(S_f) = 0.95 requires a smaller sample size to distinguish between the two copula. Though the analysis does not address the question as to whether one or both copula are adequate for the problem at hand. For a final comparison, if the aforementioned simulation run is repeated for F(S_f) = 0.80 or f=0.80, then the n for which the error rate is about \beta = 0.05 is n \approx 1{,}700. Thus as analysis is made further away from the tail into the center of the distribution, the sample size to distinguish between these two similar copula increases substantially.

Author(s)

William Asquith william.asquith@ttu.edu

References

Kiriliouk, Anna, Segers, Johan, Warchoł, Michał, 2016, Nonparameteric estimation of extremal dependence: in Extreme Value Modeling and Risk Analysis, D.K. Dey and Jun Yan eds., Boca Raton, FL, CRC Press, ISBN 978–1–4987–0129–7.

Examples

## Not run: 
UV <- simCOP(n=500, cop=HRcop, para=1.3, graphics=FALSE)
W <- seq(0,1,by=0.005)
Hu <- spectralmeas(UV, w=W)
Hs <- spectralmeas(UV, w=W, smooth=TRUE, nu=100)
plot(W,Hu, type="l", ylab="Spectral Measure H", xlab="Angle")
lines(W, Hs, col=2) #
## End(Not run)

## Not run: 
"GAUScop" <- function(u,v, para=NULL, ...) {
  if(length(u)==1) u<-rep(u,length(v)); if(length(v)==1) v<-rep(v,length(u))
  return(copula::pCopula(matrix(c(u,v), ncol=2), para))
}
GAUSparfn <- function(rho) return(copula::normalCopula(rho, dim = 2))
n <- 2000 # The PSP parent has no upper tail dependency
uv    <- simCOP(n=n, cop=PSP,      para=NULL, graphics=FALSE)
PLpar <- mleCOP(uv,  cop=PLcop,    interval=c(0,100))$para
PLuv  <- simCOP(n=n, cop=PLcop,    para=PLpar, graphics=FALSE)
GApar <- mleCOP(uv,  cop=GAUScop,  parafn=GAUSparfn, interval=c(-1,1))$para
GAuv  <- simCOP(n=n, cop=GAUScop,  para=GApar, graphics=FALSE)
GLpar <- mleCOP(uv,  cop=GLcop,    interval=c(0,100))$para
GLuv  <- simCOP(n=n, cop=GLcop,    para=GLpar, graphics=FALSE)
FF <- c(0.001,seq(0.005,0.995, by=0.005),0.999); qFF <- qnorm(FF)
f <- 0.90 # Seeking beyond the 90th percentile pseudo-polar radius
PSPh <- spectralmeas(  uv, w=FF, f=f, smooth=TRUE, snv=TRUE)
PLh  <- spectralmeas(PLuv, w=FF, f=f, smooth=TRUE, snv=TRUE)
GAh  <- spectralmeas(GAuv, w=FF, f=f, smooth=TRUE, snv=TRUE)
GLh  <- spectralmeas(GLuv, w=FF, f=f, smooth=TRUE, snv=TRUE)
plot(qFF, PSPh, type="l", lwd=2, xlim=c(-3,3), ylim=c(-2,2),
     xlab="STANDARD NORMAL VARIATE OF PSEUDO-POLAR ANGLE",
     ylab="STANDARD NORMAL VARIATE OF SPECTRAL MEASURE PROBABILITY")
lines(qFF, PLh, col=2) #  red  line is the Plackett copula
lines(qFF, GAh, col=3) # green line is the Gaussian copula
lines(qFF, GLh, col=4) #  blue line is the Galambos copula
# Notice the flat spot and less steep nature of the PSP (black line), which is
# indicative of no to even spreading tail dependency. The Plackett and Gaussian
# copulas show no specific steepening near the middle, which remains indicative
# of no tail dependency with the Plackett being less steep because it has a more
# dispersed copula density at the right tail is approached than the Gaussian.
# The Galambos copula has upper tail dependency, which is seen by
# the mass concentration and steepening of the curve on the plot.
## End(Not run)

[Package copBasic version 2.2.4 Index]