R: dataPreprocess

dataPreprocess {fscaret}

R Documentation

dataPreprocess

Description

The functionality is realized in two main steps:

Check for near zero variance predictors and flag as near zero if:
1. the percentage of unique values is less than 20
2. the ratio of the most frequent to the second most frequent value is greater than 20,
Check for susceptibility to multicollinearity
1. Calculate correlation matrix
2. Find variables with correlation 0.9 or more and delete them

Usage

dataPreprocess(trainMatryca_nr, testMatryca_nr, labelsFrame, lk_col, lk_row, with.labels)

Arguments

`trainMatryca_nr`	Input training data matrix
`testMatryca_nr`	Input testing data matrix
`labelsFrame`	Transposed data frame of column names
`lk_col`	Number of columns
`lk_row`	Number of rows
`with.labels`	If with.labels=TRUE, additional data frame with preprocessed inputs corresponding to original data set column numbers as output is generated

Author(s)

Jakub Szlek and Aleksander Mendyk

References

Kuhn M. (2008) Building Predictive Models in R Using the caret Package Journal of Statistical Software 28(5) http://www.jstatsoft.org/.

Examples



library(fscaret)

# Create data sets and labels data frame
trainMatrix <- matrix(rnorm(150*120,mean=10,sd=1), 150, 120)

# Adding some near-zero variance attributes

temp1 <- matrix(runif(150,0.0001,0.0005), 150, 12)

# Adding some highly correlated attributes

sampleColIndex <- sample(ncol(trainMatrix), size=10)

temp2 <- matrix(trainMatrix[,sampleColIndex]*2, 150, 10)

# Output variable

output <- matrix(rnorm(150,mean=10,sd=1), 150, 1)

trainMatrix <- cbind(trainMatrix,temp1,temp2, output)

colnames(trainMatrix) <- paste("X",c(1:ncol(trainMatrix)),sep="")

# Subset test data set

testMatrix <- trainMatrix[sample(round(0.1*nrow(trainMatrix))),]

labelsDF <- data.frame("Labels"=paste("X",c(1:(ncol(trainMatrix)-1)),sep=""))

lk_col <- ncol(trainMatrix)
lk_row <- nrow(trainMatrix)

with.labels = TRUE

testRes <- dataPreprocess(trainMatrix, testMatrix,
			  labelsDF, lk_col, lk_row, with.labels)
			  
summary(testRes)

# Selected attributes after data set preprocessing
testRes$labelsDF

# Training and testing data sets after preprocessing
testRes$trainMatryca
testRes$testMatryca

[Package fscaret version 0.9.4.4 Index]