brif.trainpredict {brif} | R Documentation |

## Train a model and use it to predict new cases

### Description

If the model is built to predict for just one test data set (newdata), then this function should be used instead of the `brif`

and `predict.brif`

pipeline. Transporting the model object between the training and prediction functions through saving and loading the `brif`

object takes a subtantial amount of time, and using the `pred.trainpredict`

function eliminates such time-consuming operations. This function will be automatically invoked by the `brif`

function when the newdata argument is supplied there.
If GPU is used for training (GPU = 1 or 2), the total execution time of this function includes writing and reading temporary data files. To see timing of different steps, use verbose = 1.
Note: Using GPU for training can improve training time only when the number of rows in the training data is extremely large, e.g., over 1 million. Even in such cases, GPU = 2 (hybrid mode) is recommended over GPU = 1 (force using GPU).

### Usage

```
## S3 method for class 'trainpredict'
brif(
x,
newdata,
type = c("score", "class"),
n_numeric_cuts = 31,
n_integer_cuts = 31,
max_integer_classes = 20,
max_depth = 20,
min_node_size = 1,
ntrees = 200,
ps = 0,
max_factor_levels = 30,
seed = 0,
bagging_method = 0,
bagging_proportion = 0.9,
vote_method = 1,
split_search = 4,
search_radius = 5,
verbose = 0,
nthreads = 2,
CUDA = 0,
CUDA_blocksize = 128,
CUDA_n_lb_GPU = 20480,
cubrif_main = "cubrif_main.exe",
tmp_file_prefix = "cbf",
...
)
```

### Arguments

`x` |
a data frame containing the training data set. The first column is taken as the target variable and all other columns are used as predictors. |

`newdata` |
a data frame containing the new data to be predicted. All columns in x (except for the first column which is the target variable) must be present in newdata and the data types must match. |

`type` |
a character string specifying the prediction format. Available values include "score" and "class". Default is "score". |

`n_numeric_cuts` |
an integer value indicating the maximum number of split points to generate for each numeric variable. |

`n_integer_cuts` |
an integer value indicating the maximum number of split points to generate for each integer variable. |

`max_integer_classes` |
an integer value. If the target variable is integer and has more than max_integer_classes unique values in the training data, then the target variable will be grouped into max_integer_classes bins. If the target variable is numeric, then the smaller of max_integer_classes and the number of unique values number of bins will be created on the target variables and the regression problem will be solved as a classification problem. |

`max_depth` |
an integer specifying the maximum depth of each tree. Maximum is 40. |

`min_node_size` |
an integer specifying the minimum number of training cases a leaf node must contain. |

`ntrees` |
an integer specifying the number of trees in the forest. |

`ps` |
an integer indicating the number of predictors to sample at each node split. Default is 0, meaning to use sqrt(p), where p is the number of predictors in the input. |

`max_factor_levels` |
an integer. If any factor variables has more than max_factor_levels, the program stops and prompts the user to increase the value of this parameter if the too-many-level factor is indeed intended. |

`seed` |
an integer specifying the seed used by the internal random number generator. Default is 0, meaning not to set a seed but to accept the set seed from the calling environment. |

`bagging_method` |
an integer indicating the bagging sampling method: 0 for sampling without replacement; 1 for sampling with replacement (bootstrapping). |

`bagging_proportion` |
a numeric scalar between 0 and 1, indicating the proportion of training observations to be used in each tree. |

`vote_method` |
an integer (0 or 1) specifying the voting method in prediction. 0: each leaf contributes the raw count and an average is taken on the sum over all leaves; 1: each leaf contributes an intra-node fraction which is then averaged over all leaves with equal weight. |

`split_search` |
an integer indicating the choice of the split search method. 0: randomly pick a split point; 1: do a local search; 2: random pick subject to regulation; 3: local search subject to regulation; 4 or above: a mix of options 0 to 3. |

`search_radius` |
an positive integer indicating the split point search radius. This parameter takes effect only in regulated search (split_search = 2 or above). |

`verbose` |
an integer (0 or 1) specifying the verbose level. |

`nthreads` |
an integer specifying the number of threads used by the program. This parameter takes effect only on systems supporting OpenMP. |

`CUDA` |
an integer (0, 1 or 2). 0: Do not use GPU. 1: Use GPU to build the forest. 2: Hybrid mode: Use GPU to split a node only when the node size is greater than CUDA_n_lb_GPU. |

`CUDA_blocksize` |
a positive integer specifying the CUDA thread block size, must be a multiple of 64 up to 1024. |

`CUDA_n_lb_GPU` |
a positive integer. The number of training cases must be greater than this number to enable the GPU computing when GPU = 2. |

`cubrif_main` |
a string containing the path and name of the cubrif executable (see https://github.com/profyliu/cubrif for how to build it). |

`tmp_file_prefix` |
a string for the path and prefix of temporary files created when CUDA is used. |

`...` |
additional arguments. |

### Value

a data frame or a vector containing the prediction results. See `predict.brif`

for details.

### Examples

```
trainset <- sample(1:nrow(iris), 0.5*nrow(iris))
validset <- setdiff(1:nrow(iris), trainset)
pred_score <- brif.trainpredict(iris[trainset, c(5,1:4)], iris[validset, c(1:4)], type = 'score')
pred_label <- colnames(pred_score)[apply(pred_score, 1, which.max)]
```

*brif*version 1.4.1 Index]