fastcpd {fastcpd} | R Documentation |
Find change points efficiently
Description
fastcpd()
takes in formulas, data, families and extra
parameters and returns a fastcpd object.
Usage
fastcpd(
formula = y ~ . - 1,
data,
beta = "MBIC",
cost_adjustment = "MBIC",
family = NULL,
cost = NULL,
cost_gradient = NULL,
cost_hessian = NULL,
line_search = c(1),
lower = rep(-Inf, p),
upper = rep(Inf, p),
pruning_coef = 0,
segment_count = 10,
trim = 0.02,
momentum_coef = 0,
multiple_epochs = function(x) 0,
epsilon = 1e-10,
order = c(0, 0, 0),
p = ncol(data) - 1,
cp_only = FALSE,
vanilla_percentage = 0,
warm_start = FALSE,
...
)
Arguments
formula |
A formula object specifying the model to be fitted. The
(optional) response variable should be on the LHS of the formula, while the
covariates should be on the RHS. The naming of variables used in the formula
should be consistent with the column names in the data frame provided in
|
data |
A data frame of dimension |
beta |
Penalty criterion for the number of change points. This parameter
takes a string value of |
cost_adjustment |
Cost adjustment criterion.
It can be |
family |
Family class of the change point model. It can be
|
cost |
Cost function to be used.
Users can specify a loss function using the second format that will be used
to calculate the cost value. In both formats, the input data is a subset of
the original data frame in the form of a matrix
(a matrix with a single column in the case of a univariate data set).
In the first format, the specified cost function directly calculates the cost
value. |
cost_gradient |
Gradient of the custom cost function. Example usage: cost_gradient = function(data, theta) { ... return(gradient) } The gradient function takes two inputs, the first being a matrix representing
a segment of the data, similar to the format used in the |
cost_hessian |
Hessian of the custom loss function. The Hessian function
takes two inputs, the first being a matrix representing a segment of the
data, similar to the format used in the |
line_search |
If a vector of numeric values is provided, a line search
will be performed to find the optimal step size for each update. Detailed
usage of |
lower |
Lower bound for the parameters. Used to specify the domain of
the parameters after each gradient descent step. If not specified, the lower
bound is set to be |
upper |
Upper bound for the parameters. Used to specify the domain of
the parameters after each gradient descent step. If not specified, the upper
bound is set to be |
pruning_coef |
Pruning coefficient $c_0$ used in the pruning step of the
PELT algorithm with the default value 0. If |
segment_count |
An initial guess of the number of segments. If not specified, the initial guess of the number of segments is 10. The initial guess affects the initial estimates of the parameters in SeGD. |
trim |
Trimming for the boundary change points so that a change point
close to the boundary will not be counted as a change point. This
parameter also specifies the minimum distance between two change points.
If several change points have mutual distances smaller than
|
momentum_coef |
Momentum coefficient to be applied to each update. This parameter is used when the loss function is bad-shaped so that maintaining a momentum from previous update is desired. Default value is 0, meaning the algorithm doesn't maintain a momentum by default. |
multiple_epochs |
A function can be specified such that an adaptive
number of multiple epochs can be utilized to improve the algorithm's
performance. multiple_epochs = function(segment_length) { if (segment_length < 100) 1 else 0 } This function will let SeGD perform parameter updates with an additional epoch for each segment with a length less than 100 and no additional epoch for segments with lengths greater or equal to 100. |
epsilon |
Epsilon to avoid numerical issues. Only used for the Hessian computation in Logistic Regression and Poisson Regression. |
order |
Order of the AR( |
p |
Number of covariates in the model. If not specified, the number of
covariates will be inferred from the data, i.e.,
|
cp_only |
If |
vanilla_percentage |
The parameter |
warm_start |
If |
... |
Other parameters for specific models.
|
Value
A fastcpd object.
Gallery
https://github.com/doccstat/fastcpd/tree/main/tests/testthat/examples
References
Xingchi Li, Xianyang Zhang (2024). “fastcpd: Fast Change Point Detection in R.” arXiv:2404.05933, https://arxiv.org/abs/2404.05933.
Xianyang Zhang, Trisha Dawn (2023). “Sequential Gradient Descent and Quasi-Newton's Method for Change-Point Analysis.” In Ruiz, Francisco, Dy, Jennifer, van de Meent, Jan-Willem (eds.), Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, volume 206 series Proceedings of Machine Learning Research, 1129-1143.
See Also
fastcpd.family for the family-specific function;
plot.fastcpd()
for plotting the results,
summary.fastcpd()
for summarizing the results.
Examples
if (requireNamespace("mvtnorm", quietly = TRUE)) {
set.seed(1)
n <- 200
p <- 4
d <- 2
x <- mvtnorm::rmvnorm(n, rep(0, p), diag(p))
theta_1 <- matrix(runif(8, -3, -1), nrow = p)
theta_2 <- matrix(runif(8, -1, 3), nrow = p)
y <- rbind(
x[1:125, ] %*% theta_1 + mvtnorm::rmvnorm(125, rep(0, d), 3 * diag(d)),
x[126:n, ] %*% theta_2 + mvtnorm::rmvnorm(75, rep(0, d), 3 * diag(d))
)
result_mlm <- fastcpd(
cbind(y.1, y.2) ~ . - 1, cbind.data.frame(y = y, x = x), family = "lm"
)
summary(result_mlm)
}
if (
requireNamespace("mvtnorm", quietly = TRUE) &&
requireNamespace("stats", quietly = TRUE)
) {
set.seed(1)
n <- 400 + 300 + 500
p <- 5
x <- mvtnorm::rmvnorm(n, mean = rep(0, p), sigma = diag(p))
theta <- rbind(
mvtnorm::rmvnorm(1, mean = rep(0, p - 3), sigma = diag(p - 3)),
mvtnorm::rmvnorm(1, mean = rep(5, p - 3), sigma = diag(p - 3)),
mvtnorm::rmvnorm(1, mean = rep(9, p - 3), sigma = diag(p - 3))
)
theta <- cbind(theta, matrix(0, 3, 3))
theta <- theta[rep(seq_len(3), c(400, 300, 500)), ]
y_true <- rowSums(x * theta)
factor <- c(
2 * stats::rbinom(400, size = 1, prob = 0.95) - 1,
2 * stats::rbinom(300, size = 1, prob = 0.95) - 1,
2 * stats::rbinom(500, size = 1, prob = 0.95) - 1
)
y <- factor * y_true + stats::rnorm(n)
data <- cbind.data.frame(y, x)
huber_threshold <- 1
huber_loss <- function(data, theta) {
residual <- data[, 1] - data[, -1, drop = FALSE] %*% theta
indicator <- abs(residual) <= huber_threshold
sum(
residual^2 / 2 * indicator +
huber_threshold * (
abs(residual) - huber_threshold / 2
) * (1 - indicator)
)
}
huber_loss_gradient <- function(data, theta) {
residual <- c(data[nrow(data), 1] - data[nrow(data), -1] %*% theta)
if (abs(residual) <= huber_threshold) {
-residual * data[nrow(data), -1]
} else {
-huber_threshold * sign(residual) * data[nrow(data), -1]
}
}
huber_loss_hessian <- function(data, theta) {
residual <- c(data[nrow(data), 1] - data[nrow(data), -1] %*% theta)
if (abs(residual) <= huber_threshold) {
outer(data[nrow(data), -1], data[nrow(data), -1])
} else {
0.01 * diag(length(theta))
}
}
huber_regression_result <- fastcpd(
formula = y ~ . - 1,
data = data,
beta = (p + 1) * log(n) / 2,
cost = huber_loss,
cost_gradient = huber_loss_gradient,
cost_hessian = huber_loss_hessian
)
summary(huber_regression_result)
}
set.seed(1)
p <- 5
x <- matrix(rnorm(375 * p, 0, 1), ncol = p)
theta <- rbind(rnorm(p, 0, 1), rnorm(p, 2, 1))
y <- c(
rbinom(200, 1, 1 / (1 + exp(-x[1:200, ] %*% theta[1, ]))),
rbinom(175, 1, 1 / (1 + exp(-x[201:375, ] %*% theta[2, ])))
)
data <- data.frame(y = y, x = x)
result_builtin <- suppressWarnings(fastcpd.binomial(data))
logistic_loss <- function(data, theta) {
x <- data[, -1]
y <- data[, 1]
u <- x %*% theta
nll <- -y * u + log(1 + exp(u))
nll[u > 10] <- -y[u > 10] * u[u > 10] + u[u > 10]
sum(nll)
}
logistic_loss_gradient <- function(data, theta) {
x <- data[nrow(data), -1]
y <- data[nrow(data), 1]
c(-(y - 1 / (1 + exp(-x %*% theta)))) * x
}
logistic_loss_hessian <- function(data, theta) {
x <- data[nrow(data), -1]
prob <- 1 / (1 + exp(-x %*% theta))
(x %o% x) * c((1 - prob) * prob)
}
result_custom <- fastcpd(
formula = y ~ . - 1,
data = data,
epsilon = 1e-5,
cost = logistic_loss,
cost_gradient = logistic_loss_gradient,
cost_hessian = logistic_loss_hessian
)
cat(
"Change points detected by built-in logistic regression model: ",
result_builtin@cp_set, "\n",
"Change points detected by custom logistic regression model: ",
result_custom@cp_set, "\n",
sep = ""
)
result_custom_two_epochs <- fastcpd(
formula = y ~ . - 1,
data = data,
multiple_epochs = function(segment_length) 1,
epsilon = 1e-5,
cost = logistic_loss,
cost_gradient = logistic_loss_gradient,
cost_hessian = logistic_loss_hessian
)
summary(result_custom_two_epochs)
if (requireNamespace("mvtnorm", quietly = TRUE)) {
set.seed(1)
n <- 480
p_true <- 6
p <- 50
x <- mvtnorm::rmvnorm(n, rep(0, p), diag(p))
theta_0 <- rbind(
runif(p_true, -5, -2),
runif(p_true, -3, 3),
runif(p_true, 2, 5),
runif(p_true, -5, 5)
)
theta_0 <- cbind(theta_0, matrix(0, ncol = p - p_true, nrow = 4))
y <- c(
x[1:80, ] %*% theta_0[1, ] + rnorm(80, 0, 1),
x[81:200, ] %*% theta_0[2, ] + rnorm(120, 0, 1),
x[201:320, ] %*% theta_0[3, ] + rnorm(120, 0, 1),
x[321:n, ] %*% theta_0[4, ] + rnorm(160, 0, 1)
)
small_lasso_data <- cbind.data.frame(y, x)
result_no_vp <- fastcpd.lasso(
small_lasso_data,
beta = "BIC",
cost_adjustment = NULL,
pruning_coef = 0
)
summary(result_no_vp)
result_20_vp <- fastcpd.lasso(
small_lasso_data,
beta = "BIC",
cost_adjustment = NULL,
vanilla_percentage = 0.2,
pruning_coef = 0
)
summary(result_20_vp)
}