fill_gaps {forecastML}R Documentation

Prepare a dataset for modeling by filling in temporal gaps in data collection

Description

In order to create a modeling dataset with feature lags that are temporally correct, the entry function in forecastML, create_lagged_df, needs evenly-spaced time series with no gaps in data collection. fill_gaps() can help here. This function takes a data.frame with (a) dates, (b) the outcome being forecasted, and, optionally, (c) dynamic features that change through time, (d) group columns for multiple time series modeling, and (e) static or non-dynamic features for multiple time series modeling and returns a data.frame with rows evenly spaced in time. Specifically, this function adds rows to the input dataset while filling in (a) dates, (b) grouping information, and (c) static features. The (a) outcome and (b) dynamic features will be NA for any missing time periods; these NA values can be left as-is, user-imputed, or removed from modeling in the user-supplied modeling wrapper function for train_model.

Usage

fill_gaps(data, date_col = 1, frequency, groups = NULL, static_features = NULL)

Arguments

data

A data.frame or object coercible to a data.frame with, minimally, dates and the outcome being forecasted.

date_col

The column index–an integer–of the date index. This column should have class 'Date' or 'POSIXt'.

frequency

Date/time frequency. A string taking the same input as base::seq.Date(..., by = "frequency") or base::seq.POSIXt..., by = "frequency") e.g., '1 hour', '1 month', '7 days', '10 years' etc. The highest frequency supported at present is '1 sec'.

groups

Optional. A character vector of column names that identify the unique time series (i.e., groups/hierarchies) when multiple time series are present.

static_features

Optional. For grouped time series only. A character vector of column names that identify features that do not change through time. These columns are expected to be used as model features but are not lagged (e.g., a ZIP code column). The most recent values for each static feature for each group are used to fill in the resulting missing data in static features when new rows are added to the dataset.

Value

An object of class 'data.frame': The returned data.frame has the same number of columns and column order but with additional rows to account for gaps in data collection. For grouped data, any new rows added to the returned data.frame will appear between the minimum–or oldest–date for that group and the maximum–or most recent–date across all groups. If the user-supplied forecasting algorithm(s) cannot handle missing outcome values or missing dynamic features, these should either be imputed prior to create_lagged_df() or filtered out in the user-supplied modeling function for train_model.

Methods and related functions

The output of fill_gaps() is passed into

Examples

# NOAA buoy dataset with gaps in data collection
data("data_buoy_gaps", package = "forecastML")

data_buoy_no_gaps <- fill_gaps(data_buoy_gaps, date_col = 1, frequency = '1 day',
                               groups = 'buoy_id', static_features = c('lat', 'lon'))

# The returned data.frame has the same number of columns but the time-series
# are now evenly spaced at 1 day apart. Additionally, the unchanging grouping
# columns and static features columns have been filled in for the newly created dataset rows.
dim(data_buoy_gaps)
dim(data_buoy_no_gaps)

# Running create_lagged_df() is the next step in the forecastML forecasting
# process. If there are long gaps in data collection, like in this buoy dataset,
# and the user-supplied modeling algorithm cannot handle missing outcomes data,
# the best option is to filter these rows out in the user-supplied modeling function
# for train_model()

[Package forecastML version 0.9.0 Index]