preprocess {zebu}R Documentation

Preprocess data

Description

Subroutine called by lassie. Discretizes, subsets and remove missing data from a data.frame.

Usage

preprocess(x, select, continuous, breaks, default_breaks = 4)

Arguments

x

data.frame or matrix.

select

optional vector of column numbers or column names specifying a subset of data to be used. By default, uses all columns.

continuous

optional vector of column numbers or column names specifying continuous variables that should be discretized. By default, assumes that every variable is categorical.

breaks

numeric vector or list passed on to cut to discretize continuous variables. When a numeric vector is specified, break points are applied to all continuous variables. In order to specify variable-specific breaks, lists are used. List names identify variables and list values identify breaks. List names are column names (not numbers). If a continuous variable has no specified breaks, then default_breaks will be applied.

default_breaks

default break points for discretizations. Same syntax as in cut.

Value

List containing the following values:

Examples

# This is what happens behind the curtains in the 'lassie' function
# Here we compute the association between the 'Girth' and 'Height' variables
# of the 'trees' dataset

# 'select' and 'continuous' take column numbers or names
select <- c('Girth', 'Height') # select subset of trees
continuous <-c(1, 2) # both 'Girth' and 'Height' are continuous

# equal-width discretization with 3 bins
breaks <- 3

# Preprocess data: subset, discretize and remove missing data
pre <- preprocess(trees, select, continuous, breaks)

# Estimates marginal and multivariate probabilities from preprocessed data.frame
prob <- estimate_prob(pre$pp)

# Computes local and global association using Ducher's Z
lam <- local_association(prob, measure = 'z')

[Package zebu version 0.2.2.0 Index]