clustEff {clustEff}R Documentation

Cluster Effects Algorithm

Description

This function implements the algorithm to cluster curves of effects obtained from a quantile regression (qrcm; Frumento and Bottai, 2015) in which the coefficients are described by flexible parametric functions of the order of the quantile. This algorithm can be also used for clustering of curves observed in time, as in functional data analysis.

Usage

clustEff(Beta, Beta.lower = NULL, Beta.upper = NULL,
         k = c(2, min(5, (ncol(Beta)-1))), ask = FALSE, diss.mat, alpha = .5,
         step = c("both", "shape", "distance"),
         cut.method = c("mindist", "length", "conf.int"),
         method = "ward.D2", approx.spline = FALSE, nbasis = 50,
         conf.level = 0.9, stand = FALSE, plot = TRUE, trace = TRUE)

Arguments

Beta

A matrix n x q. q represents the number of curves to cluster and n is either the length of percentiles used in the quantile regression or the length of the time vector.

Beta.lower

A matrix n x q. q represents the number of lower interval of the curves to cluster and n the length of percentiles used in quantile regression. Used only if cluster.effects=TRUE.

Beta.upper

A matrix n x q. q represents the number of upper interval of the curves to cluster and n the length of percentiles used in quantile regression. Used only if cluster.effects=TRUE.

k

It represents the number of clusters to look for. If it is two-length vector (k.min - k.max) an optimization is performed, if it is a unique value it is fixed.

ask

If TRUE, after plotting the dendrogram, the user make is own choice about how many cluster to use.

diss.mat

a dissimilarity matrix, obtained by using distshape function.

alpha

It is the alpha-percentile used for computing the dissimilarity matrix. The default value is alpha=.5.

step

The steps used in computing the dissimilarity matrix. Default is "both"=("shape" and "distance")

cut.method

The method used in optimization step to look for the optimal number of clusters. Default is "mindist", however if Beta.lower and Beta.upper are available the suggested method is "conf.int".

method

The agglomeration method to be used.

approx.spline

If TRUE, Beta is approximated by a smooth spline.

nbasis

An integer variable specifying the number of basis functions. Only when approx.spline=TRUE

conf.level

the confidence level required.

stand

If TRUE, the argument Beta is standardized.

plot

If TRUE, dendrogram, boxplot and clusters are plotted.

trace

If TRUE, some informations are printed.

Details

Quantile regression models conditional quantiles of a response variabile, given a set of covariates. Assume that each coefficient can be expressed as a parametric function of p in the form:

\beta(p | \theta) = \theta_{0} + \theta_1 b_1(p) + \theta_2 b_2(p) + \ldots

where b_1(p), b_2(p, \ldots) are known functions of p.

Value

An object of class “clustEff”, a list containing the following items:

call

the matched call.

p

The percentiles used in quantile regression coefficient modeling or the time otherwise.

X

The curves matrix.

clusters

The vector of clusters.

X.mean

The mean curves matrix of dimension n x k.

X.mean.dist

The within cluster distance from the mean curve.

X.lower

The lower bound matrix.

X.mean.lower

The mean lower bound of dimension n x k.

X.upper

The upper bound matrix.

X.mean.upper

The mean upper bound of dimension n x k.

Signif.interval

The matrix of dimension n x k containing the intervals in which each mean lower and upper bounds don't include the zero.

k

The number of selected clusters.

diss.matrix

The dissimilarity matrix.

X.mean.diss

The within cluster dissimilarity.

oggSilhouette

An object of class “silhouette”.

oggHclust

An object of class “hclust”.

distance

A vector of goodness measures used to select the best number of clusters.

step

The selected step.

method

The used agglomeration method.

cut.method

The used method to select the best number of clusters.

alpha

The selected alpha-percentile.

Author(s)

Gianluca Sottile gianluca.sottile@unipa.it

References

Sottile, G., Adelfio, G. Clusters of effects curves in quantile regression models. Comput Stat 34, 551–569 (2019). https://doi.org/10.1007/s00180-018-0817-8

Sottile, G and Adelfio, G (2017). Clustering of effects through quantile regression. Proceedings 32nd International Workshop of Statistical Modeling, Groningen (NL), vol.2 127-130, https://iwsm2017.webhosting.rug.nl/IWSM_2017_V2.pdf.

Frumento, P., and Bottai, M. (2015). Parametric modeling of quantile regression coefficient functions. Biometrics, doi: 10.1111/biom.12410.

See Also

summary.clustEff, plot.clustEff, for summary and plotting. extract.object to extract useful objects for the clustering algorithm through a quantile regression coefficient modeling in a multivariate case.

Examples


# CURVES EFFECTS CLUSTERING

set.seed(1234)
n <- 300
q <- 2
k <- 5
x1 <- runif(n, 0, 5)
x2 <- runif(n, 0, 5)

X <- cbind(x1, x2)
rownames(X) <- 1:n
colnames(X) <- paste0("X", 1:q)

theta1 <- matrix(c(1, 1, 0, 0, 0, .5, 0, .5, 1, 2, .5, 0, 2, 1, .5),
                 ncol=k, byrow=TRUE)

theta2 <- matrix(c(1, 1, 0, 0, 0, -.3, 0, .5, 1, .5, -1.5, 0, -1, -.5, 1),
                 ncol=k, byrow=TRUE)

theta3 <- matrix(c(1, 1, 0, 0, 0, .3, 0, -.5, -1, 2, -.5, 0, 1, -.5, -1),
                 ncol=k, byrow=TRUE)

rownames(theta3) <- rownames(theta2) <- rownames(theta1) <-
    c("(intercept)", paste("X", 1:q, sep=""))
colnames(theta3) <- colnames(theta2) <- colnames(theta1) <-
    c("(intercept)", "qnorm(p)", "p", "p^2", "p^3")

Theta <- list(theta1, theta2, theta3)

B <- function(p, k){matrix(cbind(1, qnorm(p), p, p^2, p^3), nrow=k, byrow=TRUE)}
Q <- function(p, theta, B, k, X){rowSums(X * t(theta %*% B(p, k)))}

Y <- matrix(NA, nrow(X), 15)
for(i in 1:15){
  if(i <= 5) Y[, i] <- Q(runif(n), Theta[[1]], B, k, cbind(1, X))
  if(i <= 10 & i > 5) Y[, i] <- Q(runif(n), Theta[[2]], B, k, cbind(1, X))
  if(i <= 15 & i > 10) Y[, i] <- Q(runif(n), Theta[[3]], B, k, cbind(1, X))
}

XX <- extract.object(Y, X, intercept=TRUE, formula.p= ~ I(p) + I(p^2) + I(p^3))

obj <- clustEff(XX$X$X1, Beta.lower=XX$Xl$X1, Beta.upper=XX$Xr$X1, cut.method = "conf.int")
summary(obj)
plot(obj, xvar="clusters", col = 1:3)
plot(obj, xvar="dendrogram")
plot(obj, xvar="boxplot")

obj2 <- clustEff(XX$X$X2, Beta.lower=XX$Xl$X2, Beta.upper=XX$Xr$X2, cut.method = "conf.int")
summary(obj2)
plot(obj2, xvar="clusters", col=1:3)
plot(obj2, xvar="dendrogram")
plot(obj2, xvar="boxplot")


## Not run: 
set.seed(1234)
n <- 300
q <- 15
k <- 5
X <- matrix(rnorm(n*q), n, q); X <- scale(X)
rownames(X) <- 1:n
colnames(X) <- paste0("X", 1:q)

Theta <- matrix(c(1, 1, 0, 0, 0,
                  .5, 0, .5, 1, 1,
                  .5, 0, 1, 2, .5,
                   .5, 0, 1, 1, .5,
                  .5, 0, .5, 1, 1,
                   .5, 0, .5, 1, .5,
                 -1.5, 0, -.5, 1, 1,
                  -1, 0, .5, -1, -1,
                 -.5, 0, -.5, -1, .5,
                  -1, 0, .5, -1, -.5,
                -1.5, 0, -.5, -1, -.5,
                  2, 0, 1, 1.5, 2,
                  2, 0, .5, 1.5, 2,
                  2.5, 0, 1, 1, 2,
                  1.5, 0, 1.5, 1, 2,
                  3, 0, 2, 1, .5),
                 ncol=k, byrow=TRUE)
rownames(Theta) <- c("(intercept)", paste("X", 1:q, sep=""))
colnames(Theta) <- c("(intercept)", "qnorm(p)", "p", "p^2", "p^3")

B <- function(p, k){matrix(cbind(1, qnorm(p), p, p^2, p^3), nrow=k, byrow=TRUE)}
Q <- function(p, theta, B, k, X){rowSums(X * t(theta %*% B(p, k)))}

s <- matrix(1, q+1, k)
s[2:(q+1), 2] <- 0
s[1, 3:k] <- 0

Y <- Q(runif(n), Theta, B, k, cbind(1, X))
XX <- extract.object(Y, X, intercept = TRUE, formula.p= ~ I(p) + I(p^2) + I(p^3))

obj3 <- clustEff(XX$X, Beta.lower=XX$Xl, Beta.upper=XX$Xr, cut.method = "conf.int")
summary(obj3)

# changing the alpha-percentile clusters are correctly identified

obj4 <- clustEff(XX$X, Beta.lower=XX$Xl, Beta.upper=XX$Xr, cut.method = "conf.int",
                 alpha = 0.25)
summary(obj4)

# CURVES CLUSTERING IN FUNCTIONAL DATA ANALYSIS

set.seed(1234)
n <- 300
x <- 1:n/n

Y <- matrix(0, n, 30)

sigma2 <- 4*pmax(x-.2, 0) - 8*pmax(x-.5, 0) + 4*pmax(x-.8, 0)

mu <- sin(3*pi*x)
for(i in 1:10) Y[, i] <- mu + rnorm(length(x), 0, pmax(sigma2, 0))

mu <- cos(3*pi*x)
for(i in 11:23) Y[,i] <- mu + rnorm(length(x), 0, pmax(sigma2,0))

mu <- sin(3*pi*x)*cos(pi*x)
for(i in 24:28) Y[, i] <- mu + rnorm(length(x), 0, pmax(sigma2, 0))

mu <- 0 #sin(1/3*pi*x)*cos(2*pi*x)
for(i in 29:30) Y[, i] <- mu + rnorm(length(x), 0, pmax(sigma2, 0))

obj5 <- clustEff(Y)
summary(obj5)
plot(obj5, xvar="clusters", col=1:4)
plot(obj5, xvar="dendrogram")
plot(obj5, xvar="boxplot")

## End(Not run)


[Package clustEff version 0.2.0 Index]