ssr {ssr} | R Documentation |
Fits a semi-supervised regression model
Description
This function implements the co-training by committee and self-learning semi-supervised regression algorithms with a set of n base regressor(s) specified by the user. When only one model is present in the list of regressors, self-learning is performed.
Usage
ssr(theFormula, L, U, regressors = list(lm = lm, knn = caret::knnreg),
regressors.params = NULL, pool.size = 20, gr = 1, maxits = 20,
testdata = NULL, shuffle = TRUE, verbose = TRUE,
plotmetrics = FALSE, U.y = NULL)
Arguments
theFormula |
a |
L |
a data frame that contains the initial labeled training set. |
U |
a data frame that contains the unlabeled data. If the provided data frame has the response variable as one of its columns, it will be discarded. |
regressors |
a list of custom functions and/or strings naming the regression models to be used.
The strings must contain a valid name of a regression model from the 'caret' package.
The list of available regression models from the 'caret' package can be found here.
Functions must be named, e.g., |
regressors.params |
a list of lists that specifies the parameters for each custom function.
For 'caret' models specified as strings in |
pool.size |
specifies the number of candidate elements to be sampled from the unlabeled set |
gr |
an integer specifying the growth rate, i.e., how many of the best elements from the pool are added to the training set for each base model at each iteration. |
maxits |
an integer that specifies the maximum number of iterations.
The training phase will terminate either when |
testdata |
a data frame containing the test set to be evaluated within each iteration.
If |
shuffle |
a boolean specifying whether or not to shuffle the data frames rows before training the models. Defaults to |
verbose |
a boolean specifying whether or not to print diagnostic information to the console within each iteration.
If |
plotmetrics |
a boolean that specifies if performance metrics should be plotted for each iteration when |
U.y |
an optional numeric vector with the true values fo the response variable for the unlabeled set |
Details
The co-training by committee implementation is based on Hady et al. (2009). It consists of a set of n base models (the committee), each, initially trained with independent bootstrap samples from the labeled training set L
. The Out-of-Bag (OOB) elements are used for validation. The training set for each base model b is augmented by selecting the most relevant elements from the unlabeled data set U
. To determine the most relevant elements for each base model b, the other models (excluding b) label a set of data pool.size
points sampled from U
by taking the average of their predictions. For each newly labeled data point, the base model b is trained with its current labeled training data plus the new data point and the error on its OOB validation data is computed. The top gr
points that reduce the error the most are kept and used to augment the labeled training set of b and removed from U
.
When the regressors
list contains a single model, self-learning is performed. In this case, the base model labels its own data points as opposed to co-training by committee in which the data points for a given model are labeled by the other models.
In the original paper, Hady et al. (2009) use the same type of regressor for the base models but with different parameters to introduce diversity. The ssr
function allows the user to specify any type of regressors as the base models. The regressors can be models from the 'caret' package, other packages, or custom functions. Models from other packages or custom functions need to comply with certain structure. First, the model's function used for training must have a formula as its first parameter and a parameter named data
that accepts a data frame as the training set. Secondly, the predict()
function must have the trained model as its first parameter. Most of the models from other libraries follow this pattern. If they do not follow this pattern, you can still use them by writing a wrapper function. To see examples of all those cases, please check the vignettes.
Value
A list object of class "ssr" containing:
models A list of the final trained models in the last iteration.
formula The formula provided by the user in theFormula
.
regressors The list of initial regressors
set by the user with formatted names.
regressors.names The names of the regressors names(regressors)
.
regressors.params The initial list of parameters provided by the user.
pool.size The initial pool.size
specified by the user.
gr The initial gr
specified by the user.
testdata A boolean indicating if test data was provided by the user: !is.null(testdata)
.
U.y A boolean indicating if U.y
was provided by the user: !is.null(U.y)
.
numits The total number of iterations performed by the algorithm.
shuffle The initial shuffle
value specified by the user.
valuesRMSE A numeric vector with the Root Mean Squared error on the testdata
for each iteration.
The length is the number of iterations + 1.
The first position valuesRMSE[1]
stores the initial RMSE before using any data from U
.
valuesRMSE.all A numeric matrix with n rows and m columns.
Stores Root Mean Squared Errors of the individual regression models.
The number of rows is equal to the number of iterations + 1 and the number of columns is equal to the number of regressors.
A column represents a regressor in the same order as they were provided in regressors
.
Each row stores the RMSE for each iteration and for each regression model.
The first row stores the initial RMSE before using any data from U
.
valuesMAE Stores Mean Absolute Error information. Equivalent to valuesRMSE.
valuesMAE.all Stores Mean Absolute Errors of the individual regression models. Equivalent to valuesRMSE.all
valuesCOR Stores Pearson Correlation information. Equivalent to valuesRMSE.
valuesCOR.all Stores the Pearson Correlation of the individual regression models. Equivalent to valuesRMSE.all
References
Hady, M. F. A., Schwenker, F., & Palm, G. (2009). Semi-supervised Learning for Regression with Co-training by Committee. In International Conference on Artificial Neural Networks (pp. 121-130). Springer, Berlin, Heidelberg.
Examples
dataset <- friedman1 # Load friedman1 dataset.
set.seed(1234)
# Split the dataset into 70% for training and 30% for testing.
split1 <- split_train_test(dataset, pctTrain = 70)
# Choose 5% of the train set as the labeled set L and the remaining will be the unlabeled set U.
split2 <- split_train_test(split1$trainset, pctTrain = 5)
L <- split2$trainset
U <- split2$testset[, -11] # Remove the labels.
testset <- split1$testset
# Define list of regressors. Here, only one regressor (KNN). This trains a self-learning model.
# For co-training by committee, add more regressors to the list. See the vignettes for examples.
regressors <- list(knn = caret::knnreg)
# Fit the model.
model <- ssr("Ytrue ~ .", L, U, regressors = regressors, testdata = testset, maxits = 10)
# Plot RMSE.
plot(model)
# Get the predictions on the testset.
predictions <- predict(model, testset)
# Calculate RMSE on the test set.
sqrt(mean((predictions - testset$Ytrue)^2))