dataPreprocess {fscaret} | R Documentation |
dataPreprocess
Description
The functionality is realized in two main steps:
Check for near zero variance predictors and flag as near zero if:
the percentage of unique values is less than 20
the ratio of the most frequent to the second most frequent value is greater than 20,
Check for susceptibility to multicollinearity
Calculate correlation matrix
Find variables with correlation 0.9 or more and delete them
Usage
dataPreprocess(trainMatryca_nr, testMatryca_nr, labelsFrame, lk_col, lk_row, with.labels)
Arguments
trainMatryca_nr |
Input training data matrix |
testMatryca_nr |
Input testing data matrix |
labelsFrame |
Transposed data frame of column names |
lk_col |
Number of columns |
lk_row |
Number of rows |
with.labels |
If with.labels=TRUE, additional data frame with preprocessed inputs corresponding to original data set column numbers as output is generated |
Author(s)
Jakub Szlek and Aleksander Mendyk
References
Kuhn M. (2008) Building Predictive Models in R Using the caret Package Journal of Statistical Software 28(5) http://www.jstatsoft.org/.
Examples
library(fscaret)
# Create data sets and labels data frame
trainMatrix <- matrix(rnorm(150*120,mean=10,sd=1), 150, 120)
# Adding some near-zero variance attributes
temp1 <- matrix(runif(150,0.0001,0.0005), 150, 12)
# Adding some highly correlated attributes
sampleColIndex <- sample(ncol(trainMatrix), size=10)
temp2 <- matrix(trainMatrix[,sampleColIndex]*2, 150, 10)
# Output variable
output <- matrix(rnorm(150,mean=10,sd=1), 150, 1)
trainMatrix <- cbind(trainMatrix,temp1,temp2, output)
colnames(trainMatrix) <- paste("X",c(1:ncol(trainMatrix)),sep="")
# Subset test data set
testMatrix <- trainMatrix[sample(round(0.1*nrow(trainMatrix))),]
labelsDF <- data.frame("Labels"=paste("X",c(1:(ncol(trainMatrix)-1)),sep=""))
lk_col <- ncol(trainMatrix)
lk_row <- nrow(trainMatrix)
with.labels = TRUE
testRes <- dataPreprocess(trainMatrix, testMatrix,
labelsDF, lk_col, lk_row, with.labels)
summary(testRes)
# Selected attributes after data set preprocessing
testRes$labelsDF
# Training and testing data sets after preprocessing
testRes$trainMatryca
testRes$testMatryca