R: Create model training and forecasting datasets with lagged,...

create_lagged_df {forecastML}

R Documentation

Create model training and forecasting datasets with lagged, grouped, dynamic, and static features

Description

Create a list of datasets with lagged, grouped, dynamic, and static features to (a) train forecasting models for specified forecast horizons and (b) forecast into the future with a trained ML model.

Usage

create_lagged_df(
  data,
  type = c("train", "forecast"),
  method = c("direct", "multi_output"),
  outcome_col = 1,
  horizons,
  lookback = NULL,
  lookback_control = NULL,
  dates = NULL,
  frequency = NULL,
  dynamic_features = NULL,
  groups = NULL,
  static_features = NULL,
  predict_future = NULL,
  use_future = FALSE,
  keep_rows = FALSE
)

Arguments

`data`	A data.frame with the (a) target to be forecasted and (b) features/predictors. An optional date column can be given in the `dates` argument (required for grouped time series). Note that '`orecastML` only works with regularly spaced date/time intervals and that missing rows–usually due to periods when no data was collected–will result in incorrect feature lags. Use `fill_gaps` to fill in any missing rows/data prior to running this function.
`type`	The type of dataset to return–(a) model training or (b) forecast prediction. The default is `train`.
`method`	The type of modeling dataset to create. `direct` returns 1 data.frame for each forecast horizon and `multi_output` returns 1 data.frame for simultaneously modeling all forecast horizons. The default is `direct`.
`outcome_col`	The column index–an integer–of the target to be forecasted. If `outcome_col != 1`, the outcome column will be moved to position 1 and `outcome_col` will be set to 1 internally.
`horizons`	A numeric vector of one or more forecast horizons, h, measured in dataset rows. If `dates` are given, a horizon of 1, for example, would equal 1 * `frequency` in calendar time.
`lookback`	A numeric vector giving the lags–in dataset rows–for creating the lagged features. All non-grouping, non-static, and non-dynamic features in the input dataset, `data`, are lagged by the same values. The outcome is also lagged by default. Either `lookback` or `lookback_control` need to be specified–but not both.
`lookback_control`	A list of numeric vectors, specifying potentially unique lags for each feature. The length of the list should equal `ncol(data)` and be ordered the same as the columns in `data`. Lag values for any grouping, static, or dynamic feature columns are automatically coerced to 0 and not lagged. `list(NULL)` `lookback_control` values drop columns from the input dataset. Either `lookback` or `lookback_control` need to be specified–but not both.
`dates`	A vector or 1-column data.frame of dates/times with class 'Date' or 'POSIXt'. The length of `dates` should equal `nrow(data)`. Required if `groups` are given.
`frequency`	Date/time frequency. Required if `dates` are given. A string taking the same input as `base::seq.Date(..., by = "frequency")` or `base::seq.POSIXt(..., by = "frequency")` e.g., '1 hour', '1 month', '7 days', '10 years' etc. The highest frequency supported at present is '1 sec'.
`dynamic_features`	A character vector of column names that identify features that change through time but which are not lagged (e.g., weekday or year). If `type = "forecast"` and `method = "direct"`, these features will receive `NA` values; though, they can be filled in by the user after running this function.
`groups`	A character vector of column names that identify the groups/hierarchies when multiple time series are present. These columns are used as model features but are not lagged. Note that combining feature lags with grouped time series will result in `NA` values throughout the data.
`static_features`	For grouped time series only. A character vector of column names that identify features that do not change through time. These columns are not lagged. If `type = "forecast"`, these features will be filled forward using the most recent value for the group.
`predict_future`	When `type = "forecast"`, a function for predicting the future values of any dynamic features. This function takes `data` and `dates` as positional arguments and returns a data.frame with (a) one or more rows, (b) an "index" column of future dates, (c) group columns if needed, and (d) 1 or more columns with name(s) in `dynamic_features`.
`use_future`	Boolean. If `TRUE`, the `future.apply` package is used for creating lagged data.frames. `multisession` or `multicore` futures are especially useful for (a) grouped time series with many groups and (b) high-dimensional datasets with many lags per feature. Run `future::plan(future::multiprocess)` prior to this function to set up multissession or multicore parallel dataset creation.
`keep_rows`	Boolean. For non-grouped time series, keep the `1:max(lookback)` rows at the beginning of the time series. These rows will contain missing values for lagged features that "look back" before the start of the dataset.

Value

An S3 object of class 'lagged_df' or 'grouped_lagged_df': A list of data.frames with new columns for the lagged/non-lagged features. For method = "direct", the length of the returned list is equal to the number of forecast horizons and is in the order of horizons supplied to the horizons argument. Horizon-specific datasets can be accessed with my_lagged_df$horizon_h where 'h' gives the forecast horizon. For method = "multi_output", the length of the returned list is 1. Horizon-specific datasets can be accessed with my_lagged_df$horizon_1_3_5 where "1_3_5" represents the forecast horizons passed in horizons.

The contents of the returned data.frames are as follows:

type = 'train', non-grouped:: A data.frame with the outcome and lagged/dynamic features.
type = 'train', grouped:: A data.frame with the outcome and unlagged grouping columns followed by lagged, dynamic, and static features.
type = 'forecast', non-grouped:: (1) An 'index' column giving the row index or date of the forecast periods (e.g., a 100 row non-date-based training dataset would start with an index of 101). (2) A 'horizon' column that indicates the forecast period from 1:max(horizons). (3) Lagged features identical to the 'train', non-grouped dataset.
type = 'forecast', grouped:: (1) An 'index' column giving the date of the forecast periods. The first forecast date for each group is the maximum date from the dates argument + 1 * frequency which is the user-supplied date/time frequency.(2) A 'horizon' column that indicates the forecast period from 1:max(horizons). (3) Lagged, static, and dynamic features identical to the 'train', grouped dataset.

Attributes

names: The horizon-specific datasets that can be accessed with my_lagged_df$horizon_h.
type: Training, train, or forecasting, forecast, dataset(s).
method: direct or multi_output.
horizons: Forecast horizons measured in dataset rows.
outcome_col: The column index of the target being forecasted.
outcome_cols: If method = multi_output, the column indices of the multiple outputs in the transformed dataset.
outcome_name: The name of the target being forecasted.
outcome_names: If method = multi_output, the column names of the multiple outputs in the transformed dataset. The names take the form "outcome_name_h" where 'h' is a horizon passed in horizons.
predictor_names: The predictor or feature names from the input dataset.
row_indices: The row.names() of the output dataset. For non-grouped datasets, the first lookback + 1 rows are removed from the beginning of the dataset to remove NA values in the lagged features.
date_indices: If dates are given, the vector of dates.
frequency: If dates are given, the date/time frequency.
data_start: min(row_indices) or min(date_indices).
data_stop: max(row_indices) or max(date_indices).
groups: If groups are given, a vector of group names.
class: grouped_lagged_df, lagged_df, list

Methods and related functions

The output of create_lagged_df() is passed into

create_windows

and has the following generic S3 methods

summary
plot

Examples

# Sampled Seatbelts data from the R package datasets.
data("data_seatbelts", package = "forecastML")
#------------------------------------------------------------------------------
# Example 1 - Training data for 2 horizon-specific models w/ common lags per predictor.
horizons <- c(1, 12)
lookback <- 1:15

data <- data_seatbelts

data_train <- create_lagged_df(data_seatbelts, type = "train", outcome_col = 1,
                               horizons = horizons, lookback = lookback)
head(data_train[[length(horizons)]])

# Example 1 - Forecasting dataset
# The last 'nrow(data_seatbelts) - horizon' rows are automatically used from data_seatbelts.
data_forecast <- create_lagged_df(data_seatbelts, type = "forecast", outcome_col = 1,
                                  horizons = horizons, lookback = lookback)
head(data_forecast[[length(horizons)]])

#------------------------------------------------------------------------------
# Example 2 - Training data for one 3-month horizon model w/ unique lags per predictor.
horizons <- 3
lookback <- list(c(3, 6, 9, 12), c(4:12), c(6:15), c(8))

data_train <- create_lagged_df(data_seatbelts, type = "train", outcome_col = 1,
                               horizons = horizons, lookback_control = lookback)
head(data_train[[length(horizons)]])

[Package forecastML version 0.9.0 Index]