R: Regression for one or more samples, given some training data.

bayesRegressAssign {mmb}

R Documentation

Regression for one or more samples, given some training data.

Description

This method uses full-dependency (simple=F) Bayesian inferencing to to a regression for the target features for all of the samples given in dfValid. Assigns a regression value using either

Usage

bayesRegressAssign(
  dfTrain,
  dfValid,
  targetCol,
  selectedFeatureNames = c(),
  shiftAmount = 0.1,
  retainMinValues = 2,
  doEcdf = FALSE,
  online = 0,
  simple = FALSE,
  useParallel = NULL,
  numBuckets = ceiling(log2(nrow(df))),
  sampleFromAllBuckets = TRUE,
  regressor = NULL
)

Arguments

`dfTrain`	data.frame that holds the training data.
`dfValid`	data.frame that holds the validation samples, for each of which a probability is sought. The convention is, that if you attempt to assign a probability to a numeric value, it ought to be found in the target column of this data frame (otherwise, the target column is not required in it).
`targetCol`	character the name of targeted feature, i.e., the feature to assign a probability to.
`selectedFeatureNames`	character defaults to empty vector which defaults to using all available features. Use this to select subsets of features and to order features.
`shiftAmount`	numeric an offset value used to increase any one probability (factor) in the full built equation.
`retainMinValues`	integer to require a minimum amount of data points when segmenting the data feature by feature.
`doEcdf`	default FALSE a boolean to indicate whether to use the empirical CDF to return a probability when inferencing a continuous feature.
`online`	default 0 integer to indicate how many rows should be used to do inferencing. If zero, then only the initially given data.frame dfTrain is used. If > 0, then each inferenced sample will be attached to it and the resulting data.frame is truncated to this number. Use an integer large enough (i.e., sum of training and validation rows) to keep all samples during inferencing. A smaller amount as, e.g., in dfTrain, will keep the amount of data restricted, discarding older rows. A larger amount than, e.g., in dfTrain is also fine; dfTrain will grow to it and then discard rows.
`simple`	default FALSE boolean to indicate whether or not to use simple Bayesian inferencing instead of full. This is faster but the results are less good. If true, uses `mmb::bayesRegressSimple()`. Otherwise, uses `mmb::bayesRegress()`.
`useParallel`	boolean DEFAULT NULL this is forwarded to the underlying function `mmb::bayesRegress()` (only in simple=FALSE mode).
`numBuckets`	integer the amount of buckets to for discretization. Buckets are built in an equidistant manner, not as quantiles (i.e., one bucket has likely a different amount of values than another).
`sampleFromAllBuckets`	default TRUE boolean to indicate how to obtain values for regression from the buckets. If true, than takes values from those buckets with a non-zero probability, and according to their probability. If false, selects all values from the bucket with the highest probability.
`regressor`	Function that is given the collected values for regression and thus finally used to select a most likely value. Defaults to the built-in estimator for the empirical PDF and returns its argmax. However, any other function can be used, too, such as min, max, median, average etc. You may also use this function to obtain the raw values for further processing.#'

Author(s)

Sebastian Hönel sebastian.honel@lnu.se

Examples


df <- iris[, ]
set.seed(84735)
rn <- base::sample(rownames(df), 150)
dfTrain <- df[1:120, ]
dfValid <- df[121:150, ]
res <- mmb::bayesRegressAssign(
  dfTrain, dfValid[, !(colnames(dfValid) %in% "Sepal.Length")],
  "Sepal.Length", sampleFromAllBuckets = TRUE, doEcdf = TRUE)
cov(res, iris[121:150,]$Sepal.Length)^2