pred.forestRK {forestRK}R Documentation

Make predictions on the test data based on the forestRK model constructed from the training data

Description

Makes predictions on the test dataset based on the forestRK model constructed from the training dataset.

Please be aware that, the test data points in test.prediction.df.list , pred.for.obs.forest.rk, and num.pred.for.obs.forest.rk are re-ordered by the increasing original index number (the original rownames) of those test observations. So if you shuffled the data before seperating them into a training and a test set, the order of the data points in which they are presented under the attribute test.prediction.df.list, pred.for.obs.forest.rk, and num.pred.for.obs.forest.rk may not be same as the shuffled order of your original test set.

Calling of this function internally loads the package rapportools; this is to allow the use of is.boolean method to check one of the stopping criteria in the beginning.

The basic mechanism behind pred.forestRK function is the following:

When the function is called, it calls forestRK function after passing the user-specified training data as an argument, in order to first generate the forestRK object. After that, the function uses pred.treeRK function to make predictions on the test observations based on each individual tree in the forestRK object. Once the individual prediction from each tree are obtained for all of the test observations, the function stores those individual predictions under a big dataframe. Once that data frame is complete, then the function collapses the results by the rule of the majority votes. For example, for the m-th observation from the test set, if the most frequently predicted class type for that m-th test observation by all of the rkTrees in the forest is class type 'A', then by the rule of the majority votes, the pred.forestRK function will assign class 'A' as the predicted class type for that m-th test observation based on the forestRK model.

Usage

  pred.forestRK(x.test = data.frame(), x.training = data.frame(),
                y.training = c(), y.factor.levels,
                min.num.obs.end.node.tree = 5,
                nbags, samp.size, entropy = TRUE)

Arguments

x.test

a numericized data frame of covariates of the data points on which we want to make our predictions (typically the test observations); x.test can be obtained by applying the x.organizer() function. x.test should contain no NA or NaN's.

x.training

a numericized data frame of covariates of data points from which we build our forestRK model (typically the training observations); x.training can be obtained by applying the x.organizer() function. x.trainingshould contain no NA or NaN's.

y.training

a vector that stores numericized class types of the training data points; y.training should contain no NA or NaN's.

min.num.obs.end.node.tree

the minimum number of observations that we want each end node of our rktree to contain. Default is set to 5.

nbags

the number of bootstrap samples that we want to generate to form a forest-RK.

samp.size

the number of data points that we want each of our bootstrap sample to contain.

y.factor.levels

a vector of original names of all class types that the user considers in his or her study (can be obtained via y.organizer()$y.factor.levels)

entropy

TRUE if we use Entropy as the splitting criteria; FALSE if we use the Gini Index for the splitting criteria. Default is set to TRUE.

Value

A list containing the following items:

x.test

the original test dataset that we used to make predictions.

df.of.predictions.for.all.observations

a data frame storing predicted class types for all test observations from each tree in the forest; each row of this data frame pertains to individual test observation, and each column pertain to a specific tree from the forestRK model. This data frame stores predicted (numericized) class type of each test observation from each tree in the forestRK model.

forest.rk

a forestRK object that was generated in the beginning of the function call.

test.prediction.df.list

a list of data frames storing the prediction.df's (the data frame that can be obtained via pred.treeRK()$prediction.df) of the test observations that were generated from each tree in the forestRK model. Note that the test data points in test.prediction.df.list are re-ordered by the increasing original observation index number.

pred.for.obs.forest.rk

a vector that stores the actual predicted class labels of the test observations instead of their numericized (integer) class types. Note that the test data points in pred.for.obs.forest.rk are re-ordered by the increasing original observation index number.

num.pred.for.obs.forest.rk

the numericized version of pred.for.obs.forest.rk. Note that the test data points in num.pred.for.obs.forest.rk are re-ordered by the increasing original observation index number.

Author(s)

Hyunjin Cho, h56cho@uwaterloo.ca Rebecca Su, y57su@uwaterloo.ca

See Also

pred.treeRK forestRK

Examples

  ## example: iris dataset
  ## load the forestRK package
  library(forestRK)

  ## numericize the data
  x.train <- x.organizer(iris[,1:4], encoding = "num")[c(1:25,51:75,101:125),]
  x.test <- x.organizer(iris[,1:4], encoding = "num")[c(26:50,76:100,126:150),]
  y.train <- y.organizer(iris[c(1:25,51:75,101:125),5])$y.new

  y.factor.levels <- y.organizer(iris[c(1:25,51:75,101:125),5])$y.factor.levels

  ## make prediction from a random forest RK model
  ## typically the nbags and samp.size has to be much larger than 30 and 50
  pred.forest.rk <- pred.forestRK(x.test = x.test, x.training = x.train,
                                  y.training = y.train,
                                  y.factor.levels,
                                  min.num.obs.end.node.tree = 6,
                                  nbags = 30, samp.size = 50, entropy = FALSE)
  pred.forest.rk$test.prediction.df.list[[10]]
  pred.forest.rk$pred.for.obs.forest.rk # etc....

[Package forestRK version 0.0-5 Index]