R: Find change points efficiently

fastcpd {fastcpd}

R Documentation

Find change points efficiently

Description

fastcpd() takes in formulas, data, families and extra parameters and returns a fastcpd object.

Usage

fastcpd(
  formula = y ~ . - 1,
  data,
  beta = "MBIC",
  cost_adjustment = "MBIC",
  family = NULL,
  cost = NULL,
  cost_gradient = NULL,
  cost_hessian = NULL,
  line_search = c(1),
  lower = rep(-Inf, p),
  upper = rep(Inf, p),
  pruning_coef = 0,
  segment_count = 10,
  trim = 0.02,
  momentum_coef = 0,
  multiple_epochs = function(x) 0,
  epsilon = 1e-10,
  order = c(0, 0, 0),
  p = ncol(data) - 1,
  cp_only = FALSE,
  vanilla_percentage = 0,
  warm_start = FALSE,
  ...
)

Arguments

`formula`	A formula object specifying the model to be fitted. The (optional) response variable should be on the LHS of the formula, while the covariates should be on the RHS. The naming of variables used in the formula should be consistent with the column names in the data frame provided in `data`. The intercept term should be removed from the formula. The response variable is not needed for mean/variance change models and time series models. By default, an intercept column will be added to the data, similar to the `lm()` function. Thus, it is suggested that users should remove the intercept term by appending `- 1` to the formula. Note that the fastcpd.family functions do not require a formula input.
`data`	A data frame of dimension `T \times d` containing the data to be segmented (where each row denotes a data point `z_t \in \mathbb{R}^d` for `t = 1, \ldots, T`) is required in the main function, while a matrix or a vector input is also accepted in the fastcpd.family functions.
`beta`	Penalty criterion for the number of change points. This parameter takes a string value of `"BIC"`, `"MBIC"`, `"MDL"` or a numeric value. If a numeric value is provided, the value will be used as the penalty. By default, the mBIC criterion is used, where `\beta = (p + 2) \log(T) / 2`. This parameter usage should be paired with `cost_adjustment` described below. Discussions about the penalty criterion can be found in the references.
`cost_adjustment`	Cost adjustment criterion. It can be `"BIC"`, `"MBIC"`, `"MDL"` or `NULL`. By default, the cost adjustment criterion is set to be `"MBIC"`. The `"MBIC"` and `"MDL"` criteria modify the cost function by adding a negative adjustment term to the cost function. `"BIC"` or `NULL` does not modify the cost function. Details can in found in the references.
`family`	Family class of the change point model. It can be `"mean"` for mean change, `"variance"` for variance change, `"meanvariance"` for mean and/or variance change, `"lm"` for linear regression, `"binomial"` for logistic regression, `"poisson"` for Poisson regression, `"lasso"` for penalized linear regression, `"ar"` for AR(`p`) models, `"arma"` for ARMA(`p`, `q`) models, `"arima"` for ARIMA(`p`, `d`, `q`) models, `"garch"` for GARCH(`p`, `q`) models, `"var"` for VAR(`p`) models and `"custom"` for user-specified custom models. Omitting this parameter is the same as specifying the parameter to be `"custom"` or `NULL`, in which case, users must specify the custom cost function.
`cost`	Cost function to be used. `cost`, `cost_gradient`, and `cost_hessian` should not be specified at the same time with `family` as built-in families have cost functions implemented in C++ to provide better performance. If not specified, the default is the negative log-likelihood for the corresponding family. Custom cost functions can be provided in the following two formats: `cost = function(data) {...}` `cost = function(data, theta) {...}` Users can specify a loss function using the second format that will be used to calculate the cost value. In both formats, the input data is a subset of the original data frame in the form of a matrix (a matrix with a single column in the case of a univariate data set). In the first format, the specified cost function directly calculates the cost value. `fastcpd()` performs the vanilla PELT algorithm, and `cost_gradient` and `cost_hessian` should not be provided since no parameter updating is necessary for vanilla PELT. In the second format, the loss function `\sum_{i = s}^t l(z_i, \theta)` is provided, which has to be optimized over the parameter `\theta` to obtain the cost value. A detailed discussion about the custom cost function usage can be found in the references.
`cost_gradient`	Gradient of the custom cost function. Example usage: cost_gradient = function(data, theta) { ... return(gradient) } The gradient function takes two inputs, the first being a matrix representing a segment of the data, similar to the format used in the `cost` function, and the second being the parameter that needs to be optimized. The gradient function returns the value of the gradient of the loss function, i.e., `\sum_{i = s}^t \nabla l(z_i, \theta)`.
`cost_hessian`	Hessian of the custom loss function. The Hessian function takes two inputs, the first being a matrix representing a segment of the data, similar to the format used in the `cost` function, and the second being the parameter that needs to be optimized. The gradient function returns the Hessian of the loss function, i.e., `\sum_{i = s}^t \nabla^2 l(z_i, \theta)`.
`line_search`	If a vector of numeric values is provided, a line search will be performed to find the optimal step size for each update. Detailed usage of `line_search` can be found in the references.
`lower`	Lower bound for the parameters. Used to specify the domain of the parameters after each gradient descent step. If not specified, the lower bound is set to be `-Inf` for all parameters. `lower` is especially useful when the estimated parameters take only positive values, such as the noise variance.
`upper`	Upper bound for the parameters. Used to specify the domain of the parameters after each gradient descent step. If not specified, the upper bound is set to be `Inf` for all parameters.
`pruning_coef`	Pruning coefficient $c_0$ used in the pruning step of the PELT algorithm with the default value 0. If `cost_adjustment` is specified as `"MBIC"`, an adjustment term `p\log(2)` will be added to the pruning coefficient. If `cost_adjustment` is specified as `"MDL"`, an adjustment term `p\log_2(2)` will be added to the pruning coefficient. Detailed discussion about the pruning coefficient can be found in the references.
`segment_count`	An initial guess of the number of segments. If not specified, the initial guess of the number of segments is 10. The initial guess affects the initial estimates of the parameters in SeGD.
`trim`	Trimming for the boundary change points so that a change point close to the boundary will not be counted as a change point. This parameter also specifies the minimum distance between two change points. If several change points have mutual distances smaller than `trim * nrow(data)`, those change points will be merged into one single change point. The value of this parameter should be between 0 and 1.
`momentum_coef`	Momentum coefficient to be applied to each update. This parameter is used when the loss function is bad-shaped so that maintaining a momentum from previous update is desired. Default value is 0, meaning the algorithm doesn't maintain a momentum by default.
`multiple_epochs`	A function can be specified such that an adaptive number of multiple epochs can be utilized to improve the algorithm's performance. `multiple_epochs` is a function of the length of the data segment. The function returns an integer indicating how many epochs should be performed apart from the default update. By default, the function returns zero, meaning no multiple epochs will be used to update the parameters. Example usage: multiple_epochs = function(segment_length) { if (segment_length < 100) 1 else 0 } This function will let SeGD perform parameter updates with an additional epoch for each segment with a length less than 100 and no additional epoch for segments with lengths greater or equal to 100.
`epsilon`	Epsilon to avoid numerical issues. Only used for the Hessian computation in Logistic Regression and Poisson Regression.
`order`	Order of the AR(`p`), VAR(`p`) or ARIMA(`p`, `d`, `q`) model.
`p`	Number of covariates in the model. If not specified, the number of covariates will be inferred from the data, i.e., `p = ncol(data) - 1`. This parameter is superseded by `order` in the case of time series models: "ar", "var", "arima".
`cp_only`	If `TRUE`, only the change points are returned. Otherwise, the cost function values together with the estimated parameters for each segment are also returned. By default the value is set to be `FALSE` so that `plot` can be used to visualize the results for a built-in model. `cp_only` has some performance impact on the algorithm, since the cost values and estimated parameters for each segment need to be calculated and stored. If the users are only interested in the change points, setting `cp_only` to be `TRUE` will help with the computational cost.
`vanilla_percentage`	The parameter `v` is between zero and one. For each segment, when its length is no more than `vT`, the cost value will be computed by performing an exact minimization of the loss function over the parameter. When its length is greater than `vT`, the cost value is approximated through SeGD. Therefore, this parameter induces an algorithm that can be interpreted as an interpolation between dynamic programming with SeGD (`v = 0`) and the vanilla PELT (`v = 1`). The readers are referred to the references for more details.
`warm_start`	If `TRUE`, the algorithm will use the estimated parameters from the previous segment as the initial value for the current segment. This parameter is only used for the `"glm"` families.
`...`	Other parameters for specific models. `include.mean` is used to determine if a mean/intercept term should be included in the ARIMA(`p`, `d`, `q`) or GARCH(`p`, `q`) models. `r.clock` is used to create an `RcppClock` object to record the time spent in the C++ code. Default is an empty string. If set to any non-empty string, an object with specified name will be created. Usage: `library(RcppClock); plot(VARIABLE_NAME)`. `r.progress` is used to control the progress bar. By default the progress bar will be shown. To disable it, set `r.progress = FALSE`. `p.response` is used to specify the number of response variables. This parameter is especially useful for linear models with multivariate responses.

Value

A fastcpd object.

Gallery

https://github.com/doccstat/fastcpd/tree/main/tests/testthat/examples

References

Xingchi Li, Xianyang Zhang (2024). “fastcpd: Fast Change Point Detection in R.” arXiv:2404.05933, https://arxiv.org/abs/2404.05933.

Xianyang Zhang, Trisha Dawn (2023). “Sequential Gradient Descent and Quasi-Newton's Method for Change-Point Analysis.” In Ruiz, Francisco, Dy, Jennifer, van de Meent, Jan-Willem (eds.), Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, volume 206 series Proceedings of Machine Learning Research, 1129-1143.

Examples

if (requireNamespace("mvtnorm", quietly = TRUE)) {
  set.seed(1)
  n <- 200
  p <- 4
  d <- 2
  x <- mvtnorm::rmvnorm(n, rep(0, p), diag(p))
  theta_1 <- matrix(runif(8, -3, -1), nrow = p)
  theta_2 <- matrix(runif(8, -1, 3), nrow = p)
  y <- rbind(
    x[1:125, ] %*% theta_1 + mvtnorm::rmvnorm(125, rep(0, d), 3 * diag(d)),
    x[126:n, ] %*% theta_2 + mvtnorm::rmvnorm(75, rep(0, d), 3 * diag(d))
  )
  result_mlm <- fastcpd(
    cbind(y.1, y.2) ~ . - 1, cbind.data.frame(y = y, x = x), family = "lm"
  )
  summary(result_mlm)
}
if (
  requireNamespace("mvtnorm", quietly = TRUE) &&
    requireNamespace("stats", quietly = TRUE)
) {
  set.seed(1)
  n <- 400 + 300 + 500
  p <- 5
  x <- mvtnorm::rmvnorm(n, mean = rep(0, p), sigma = diag(p))
  theta <- rbind(
    mvtnorm::rmvnorm(1, mean = rep(0, p - 3), sigma = diag(p - 3)),
    mvtnorm::rmvnorm(1, mean = rep(5, p - 3), sigma = diag(p - 3)),
    mvtnorm::rmvnorm(1, mean = rep(9, p - 3), sigma = diag(p - 3))
  )
  theta <- cbind(theta, matrix(0, 3, 3))
  theta <- theta[rep(seq_len(3), c(400, 300, 500)), ]
  y_true <- rowSums(x * theta)
  factor <- c(
    2 * stats::rbinom(400, size = 1, prob = 0.95) - 1,
    2 * stats::rbinom(300, size = 1, prob = 0.95) - 1,
    2 * stats::rbinom(500, size = 1, prob = 0.95) - 1
  )
  y <- factor * y_true + stats::rnorm(n)
  data <- cbind.data.frame(y, x)
  huber_threshold <- 1
  huber_loss <- function(data, theta) {
    residual <- data[, 1] - data[, -1, drop = FALSE] %*% theta
    indicator <- abs(residual) <= huber_threshold
    sum(
      residual^2 / 2 * indicator +
        huber_threshold * (
          abs(residual) - huber_threshold / 2
        ) * (1 - indicator)
    )
  }
  huber_loss_gradient <- function(data, theta) {
    residual <- c(data[nrow(data), 1] - data[nrow(data), -1] %*% theta)
    if (abs(residual) <= huber_threshold) {
      -residual * data[nrow(data), -1]
    } else {
      -huber_threshold * sign(residual) * data[nrow(data), -1]
    }
  }
  huber_loss_hessian <- function(data, theta) {
    residual <- c(data[nrow(data), 1] - data[nrow(data), -1] %*% theta)
    if (abs(residual) <= huber_threshold) {
      outer(data[nrow(data), -1], data[nrow(data), -1])
    } else {
      0.01 * diag(length(theta))
    }
  }
  huber_regression_result <- fastcpd(
    formula = y ~ . - 1,
    data = data,
    beta = (p + 1) * log(n) / 2,
    cost = huber_loss,
    cost_gradient = huber_loss_gradient,
    cost_hessian = huber_loss_hessian
  )
  summary(huber_regression_result)
}

set.seed(1)
p <- 5
x <- matrix(rnorm(375 * p, 0, 1), ncol = p)
theta <- rbind(rnorm(p, 0, 1), rnorm(p, 2, 1))
y <- c(
  rbinom(200, 1, 1 / (1 + exp(-x[1:200, ] %*% theta[1, ]))),
  rbinom(175, 1, 1 / (1 + exp(-x[201:375, ] %*% theta[2, ])))
)
data <- data.frame(y = y, x = x)
result_builtin <- suppressWarnings(fastcpd.binomial(data))
logistic_loss <- function(data, theta) {
  x <- data[, -1]
  y <- data[, 1]
  u <- x %*% theta
  nll <- -y * u + log(1 + exp(u))
  nll[u > 10] <- -y[u > 10] * u[u > 10] + u[u > 10]
  sum(nll)
}
logistic_loss_gradient <- function(data, theta) {
  x <- data[nrow(data), -1]
  y <- data[nrow(data), 1]
  c(-(y - 1 / (1 + exp(-x %*% theta)))) * x
}
logistic_loss_hessian <- function(data, theta) {
  x <- data[nrow(data), -1]
  prob <- 1 / (1 + exp(-x %*% theta))
  (x %o% x) * c((1 - prob) * prob)
}
result_custom <- fastcpd(
  formula = y ~ . - 1,
  data = data,
  epsilon = 1e-5,
  cost = logistic_loss,
  cost_gradient = logistic_loss_gradient,
  cost_hessian = logistic_loss_hessian
)
cat(
  "Change points detected by built-in logistic regression model: ",
  result_builtin@cp_set, "\n",
  "Change points detected by custom logistic regression model: ",
  result_custom@cp_set, "\n",
  sep = ""
)
result_custom_two_epochs <- fastcpd(
  formula = y ~ . - 1,
  data = data,
  multiple_epochs = function(segment_length) 1,
  epsilon = 1e-5,
  cost = logistic_loss,
  cost_gradient = logistic_loss_gradient,
  cost_hessian = logistic_loss_hessian
)
summary(result_custom_two_epochs)


if (requireNamespace("mvtnorm", quietly = TRUE)) {
  set.seed(1)
  n <- 480
  p_true <- 6
  p <- 50
  x <- mvtnorm::rmvnorm(n, rep(0, p), diag(p))
  theta_0 <- rbind(
    runif(p_true, -5, -2),
    runif(p_true, -3, 3),
    runif(p_true, 2, 5),
    runif(p_true, -5, 5)
  )
  theta_0 <- cbind(theta_0, matrix(0, ncol = p - p_true, nrow = 4))
  y <- c(
    x[1:80, ] %*% theta_0[1, ] + rnorm(80, 0, 1),
    x[81:200, ] %*% theta_0[2, ] + rnorm(120, 0, 1),
    x[201:320, ] %*% theta_0[3, ] + rnorm(120, 0, 1),
    x[321:n, ] %*% theta_0[4, ] + rnorm(160, 0, 1)
  )
  small_lasso_data <- cbind.data.frame(y, x)
  result_no_vp <- fastcpd.lasso(
    small_lasso_data,
    beta = "BIC",
    cost_adjustment = NULL,
    pruning_coef = 0
  )
  summary(result_no_vp)
  result_20_vp <- fastcpd.lasso(
    small_lasso_data,
    beta = "BIC",
    cost_adjustment = NULL,
    vanilla_percentage = 0.2,
    pruning_coef = 0
  )
  summary(result_20_vp)
}

[Package fastcpd version 0.14.3 Index]