R: Constructs a classification tree on the (training) dataset,...

construct.treeRK {forestRK}

R Documentation

Constructs a classification tree on the (training) dataset, by implementing the RK (Random 'K') algorithm

Description

Constructs a classification tree based on the dataset of interest by implementing the RK (Random 'K') algorithm.

The package rapportools is loaded internally when this function is called; this is to use the method is.boolean to check one of the stopping criteria in the beginning of the function. The functions specifically from the forestRK package that are being used inside construct.treeRK are criteria.calculator and cutoff.node.and.covariate.index.finder.

The construct.treeRK output is one of the arguments that is used to call the pred.treeRK function.

DESCRIPTIONS OF THE RETURNED VALUES:

The hirarchical flag of a rktree (construct.treeRK()$flag) is constructed in the following way:

(1) the first entry of the flag, "r" denotes for "root"; (2) the subsequent strings of the flag is constructed in the way that last "x" denotes for the left child node of the node represented by the series of characters that are before the last "x", and the last "y" denotes for the right child node of the node represented by the series of characters that are before the last "y".

For example, the flag "rxyx" is the left child node of the node represented by "rxy".

x.node.list and y.node.list are the lists of children nodes (for x and y, respectively) of the rktree, listed in the order consistent to the order of the nodes represented in the rktree's hirarchical flag.

covariate.split is a matrix that lists the numericized covariate names that were used for the splits to construct the rktree. The first entry of covariate.split is NA, which stands for the condition at the root. The number immediately underneath NA is the numericized covariate name that was used for the first split in the rktree, and the number below that is the numericized covariate name that was used for the second split, etc. If the numericized covariate name listed under covariate.split is the number "n", this corresponds to the "n"-th covariate or the name of the "n"-th column of the data frame x.train.

value.at.split is a vector that lists the actual values of the covariates at which the split had occured while constructing the rktree. The first entry of value.at.split is NA, which denotes for the root prior to any splits. To give an example of how to interpret the value.at.split, if the second entry appear in the covariate.split is 4, and the second entry appear under value.at.split is 0.5, this indicates that the first split of the rktree had occured on the covariate corresponds to the 4th column of the data frame x.train, and the exact criteria for that first split was (4th covariate value) <= 0.5 vs. (4th covariate value) > 0.5.

amount.decrease.criteria is a matrix that lists the amount of decrease in splitting criteria (Entropy or Gini Index) after each split had occurred. The first entry of amount.decrease.criteria is NA, which denotes for the condition at the root (no split). To give an example, if the second entry appear in the amount.decrease.criteria is 0.91, and if entropy was set to TRUE, this means that after the first split, the Entropy of the original node had decreased by 0.91.

num.obs is a matrix that stores the number of observations contained within a parent node prior to the split; the matrix starts with the entry "NA", in order to reflect the condition at "root". The 2nd entry of num.obs would inform us on the number of observations contained within the parent node on which the 1st split had took place while the rktree was built; the 3rd entry of the num.obs would inform us on the number of observations contained within the parent node on which the 2nd split had took place, and so on.

Usage

 construct.treeRK(x.train = data.frame(), y.new.train = c(),
                  min.num.obs.end.node.tree = 5, entropy = TRUE)

Arguments

`x.train`	a numericized data frame of covariates of the data on which we want to build our rktree models (typically the training data); this data frame can be obtained by applying the `x.organizer` function. `x.train` should contain no `NA` or `NaN`'s.
`y.new.train`	a numericized class types of the observations from the dataset on which we want to build our rktree models (typically the training data). `y.new.train` should contain no `NA` or `NaN`'s.
`min.num.obs.end.node.tree`	the minimum number of observations that we want each end node of our rktree to contain. Default is set to '5'.
`entropy`	`TRUE` if Entropy is used as the splitting criteria; `FALSE` if Gini Index is used as the splitting criteria. Default is set to `TRUE`.

Value

A list containing the following items:

`covariate.names`	a vector of the names of all covariates that we consider in our model.
`l`	length of the hierarchical flag.
`x.node.list`	a list containing a series of children nodes produced from the numericized data frame `x.train` as the `rktree` model was building up.
`y.new.node.list`	a list containing a series of children nodes produced from the numericized vector of class type `y.new.train` as the `rktree` model was building up.
`flag`	hierchical flag that characterizes each split in the `rktree`.
`covariate.split`	a matrix that lists numericized covariates used for each split as the `rktree` was built.
`value.at.split`	a vector that lists the values at which each node of the `rktree` was split.
`amt.decrease.criteria`	a matrix that lists the amount of decrease in splitting criteria after each split as the `rktree` was built.
`num.obs`	a matrix that stores the number of observations contained in each parent node right before each split.

Author(s)

Hyunjin Cho, h56cho@uwaterloo.ca Rebecca Su, y57su@uwaterloo.ca

Examples

  ## example: iris dataset
  ## load the forestRK package
  library(forestRK)

  ## numericize the data
  x.train <- x.organizer(iris[,1:4], encoding = "num")[c(1:25,51:75,101:125),]
  y.train <- y.organizer(iris[c(1:25,51:75,101:125),5])$y.new

  # Construct a tree
  # min.num.obs.end.node.tree is set to 5 by default;
  # entropy is set to TRUE by default
  tree.entropy <- construct.treeRK(x.train, y.train)
  tree.gini <- construct.treeRK(x.train, y.train,
                                min.num.obs.end.node.tree = 6, entropy = FALSE)
  tree.entropy$covariate.names
  tree.gini$flag # ...etc...

[Package forestRK version 0.0-5 Index]