R: Identifies optimal cutoff point of an impure node for...

cutoff.node.and.covariate.index.finder {forestRK}

R Documentation

Identifies optimal cutoff point of an impure node for splitting after applying the `rk` (Random K) algorithm.

Description

Identifies optimal cutoff point of an impure dataset for splitting after applying the rk (Random K) algoritm, in terms of Entropy or Gini Index.

To give an example, if the function gives cutoff.value of 2.5, covariate.ind of 4, and cutoff.node of 23, this would inform the user that if a split is to be performed on the particular node that the user is considering, the split should occur on the 4th covariate (the actual name of this covariate would be the name of the 4th column from the original dataset), at the value of 2.5 (left child node in this case would be the group of observations that have their 4th covariate value less than or equal to 2.5, and for the right child node would be the group of observations that have their 4th covariate value greater than 2.5), and that this splitting point corresponds to the 23rd observation point of the node.

This function internally loads the packages partykit and rapportools; the package partykit is internally loaded to generate the object split.record.optimal, and the package rapportools is loaded to allow the validation of one of the stopping criteria that uses is.boolean method.

This function is ran internally in the construct.treeRK function.

Usage

 cutoff.node.and.covariate.index.finder(x.node = data.frame(),
                                        y.new.node = c(), entropy = TRUE)

Arguments

`x.node`	a numericized data frame of covariates of the observations from a particular node prior to the split (can be obtained after applying `x.organizer()`); `x.node` should contain no `NA` or `NaN`'s.
`y.new.node`	a vector storing numericized class type of the observations from a particular node before the split (can be obtained after applying `y.organizer()`); `y.new.node` should contain no `NA` or `NaN`'s.
`entropy`	`TRUE` if Entropy is used as the splitting criteria; `FALSE` if Gini Index is used as the splitting criteria. Default is set to `TRUE`.

Value

A list containing the following items:

`cutoff.value`	the value at which the optimal split should take place.
`cutoff.node`	the index of the observation (observation number) at which optimal split should occur.
`covariate.ind`	numeric index of the covariate at which the optimal split should occur.
`split.record.optimal`	the `kidid_split` output of the optimal split.

Author(s)

Hyunjin Cho, h56cho@uwaterloo.ca Rebecca Su, y57su@uwaterloo.ca

Examples

  ## example: iris dataset
  ## load the forestRK package
  library(forestRK)

  ## numericize the data
  x.train <- x.organizer(iris[,1:4], encoding = "num")[c(1:25,51:75,101:125),]
  y.train <- y.organizer(iris[c(1:25,51:75,101:125),5])$y.new

  # implementation of cutoff.node.and.covariate.index.finder()
  res <- cutoff.node.and.covariate.index.finder(x.train, y.train,
                                               entropy=FALSE)
  res$cutoff.value
  res$cutoff.node
  res$covariate.ind
  res$split.record.optimal

[Package forestRK version 0.0-5 Index]

Identifies optimal cutoff point of an impure node for splitting after applying the rk (Random K) algorithm.