R: Split, Merge, and Filter Given Datasets for the Subsequent...

arrange.data {gainML}

R Documentation

Split, Merge, and Filter Given Datasets for the Subsequent Analysis

Description

Generates datasets that consist of the measurements from REF, CTR-b, and CTR-n turbines only. Filters the datasets by eliminating data points with a missing measurement and those with negative power output (optional). Generates training and test datasets for k-fold CV and splits the entire data into period 1 data and period 2 data.

Usage

arrange.data(df1, df2, df3, p1.beg, p1.end, p2.beg, p2.end,
  time.format = "%Y-%m-%d %H:%M:%S", k.fold = 5, col.time = 1,
  col.turb = 2, bootstrap = NULL, free.sec = NULL,
  neg.power = FALSE)

Arguments

`df1`	A dataframe for reference turbine data. This dataframe must include five columns: timestamp, turbine id, wind direction, power output, and air density.
`df2`	A dataframe for baseline control turbine data. This dataframe must include four columns: timestamp, turbine id, wind speed, and power output.
`df3`	A dataframe for neutral control turbine data. This dataframe must include four columns and have the same structure with `df2`.
`p1.beg`	A string specifying the beginning date of period 1. By default, the value needs to be specified in ‘⁠%Y-%m-%d⁠’ format, for example, `'2014-10-24'`. A user can use a different format as long as it is consistent with the format defined in `time.format` below.
`p1.end`	A string specifying the end date of period 1. For example, if the value is `'2015-10-24'`, data observed until `'2015-10-23 23:50:00'` would be considered for period 1.
`p2.beg`	A string specifying the beginning date of period 2.
`p2.end`	A string specifying the end date of period 2. Defined similarly as `p1.end`.
`time.format`	A string describing the format of time stamps used in the data to be analyzed. The default value is `'%Y-%m-%d %H:%M:%S'`.
`k.fold`	An integer defining the number of data folds for the period 1 analysis and prediction. In the period 1 analysis, `k`-fold cross validation (CV) will be applied to choose the optimal set of covariates that results in the least prediction error. The value of `k.fold` corresponds to the `k` of the `k`-fold CV. The default value is 5.
`col.time`	An integer specifying the column number of time stamps in wind turbine datasets. The default value is 1.
`col.turb`	An integer specifying the column number of turbines' id in wind turbine datasets. The default value is 2.
`bootstrap`	An integer indicating the current replication (run) number of bootstrap. If set to `NULL`, bootstrap is not applied. The default is `NULL`. A user is not recommended to set this value and directly run bootstrap; instead, use `bootstrap.gain` to run bootstrap.
`free.sec`	A list of vectors defining free sectors. Each vector in the list has two scalars: one for starting direction and another for ending direction, ordered clockwise. For example, a vector of `c(310 , 50)` is a valid component of the list. By default, this is set to `NULL`.
`neg.power`	Either `TRUE` or `FALSE`, indicating whether or not to use data points with a negative power output, respectively, in the analysis. The default value is `FALSE`, i.e., negative power output data will be eliminated.

Value

The function returns a list of several datasets including the following.

train: A list containing k datasets that will be used to train the machine learning model.
test: A list containing k datasets that will be used to test the machine learning model.
per1: A dataframe containing the period 1 data.
per2: A dataframe containing the period 2 data.

Examples

df.ref <- with(wtg, data.frame(time = time, turb.id = 1, wind.dir = D, power = y,
 air.dens = rho))
df.ctrb <- with(wtg, data.frame(time = time, turb.id = 2, wind.spd = V, power = y))
df.ctrn <- df.ctrb
df.ctrn$turb.id <- 3

# For Full Sector Analysis
data <- arrange.data(df.ref, df.ctrb, df.ctrn, p1.beg = '2014-10-24', p1.end = '2014-10-27',
 p2.beg = '2014-10-27', p2.end = '2014-10-30')

# For Free Sector Analysis
free.sec <- list(c(310, 50), c(150, 260))
data <- arrange.data(df.ref, df.ctrb, df.ctrn, p1.beg = '2014-10-24', p1.end = '2014-10-27',
 p2.beg = '2014-10-27', p2.end = '2014-10-30', free.sec = free.sec)

length(data$train) #This equals to k.
length(data$test)  #This equals to k.
head(data$per1)    #This shows the beginning of the period 1 dataset.
head(data$per2)    #This shows the beginning of the period 2 dataset.

[Package gainML version 0.1.0 Index]