findFeatures {featurefinder}R Documentation

findFeatures

Description

Perform analysis of residuals grouped by factor to identify features which explain the target variable

Usage

findFeatures(outputPath = "NoPath", fcsv, exclusionVars, factorToNumericList,
  treeGenerationMinBucket = 20, treeSummaryMinBucket = 50,
  treeSummaryResidualThreshold = 0,
  treeSummaryResidualMagnitudeThreshold = 0, doAllFactors = TRUE,
  maxFactorLevels = 20, useSubDir = TRUE, tempDirFolderName = "")

Arguments

outputPath

A string containing the location of the input csv file. Results are also stored in this location. Set to "NoPath" to use tempdir() or leave blank

fcsv

A string containing the name of a csv file

exclusionVars

A string consisting of a list of variable names with double quotes around each variable

factorToNumericList

A list of variable names as strings

treeGenerationMinBucket

Desired minimum number of data points per leaf (default 20)

treeSummaryMinBucket

Minimum number of data points in each leaf for the summary (default 50)

treeSummaryResidualThreshold

Minimum residual in the summary (default 0 for positive residuals)

treeSummaryResidualMagnitudeThreshold

Minimum residual magnitude in the summary (default 0 i.e. no restriction)

doAllFactors

Flag to indicate whether to analyse the levels of all factor variables (default TRUE)

maxFactorLevels

Maximum number of levels per factor before it is converted to numeric (default 20)

useSubDir

Flag to specify whether the partition trees should be saved in the current directory or a subdirectory

tempDirFolderName

specify a subfolder name if writing multiple scans to the temporary directory

Value

outputPath returns the location of the output for reference in addFeatures and for any other purpose. Saves residual CART trees and associated highlighted residuals for each to the path provided.

Examples


require(featurefinder)
data(futuresdata)
data=futuresdata
data$SMIfactor=paste("smi",as.matrix(data$SMIfactor),sep="")
n=length(data$DAX)
nn=floor(length(data$DAX)/2)

# Can we predict the relative movement of DAX and SMI?
data$y=data$DAX*0 # initialise the target to 0
data$y[1:(n-1)]=((data$DAX[2:n])-(data$DAX[1:(n-1)]))/
  (data$DAX[1:(n-1)])-(data$SMI[2:n]-(data$SMI[1:(n-1)]))/(data$SMI[1:(n-1)])

# Fit a simple model
thismodel=lm(formula=y ~ .,data=data)
expected=predict(thismodel,data)
actual=data$y
residual=actual-expected
data=cbind(data,expected, actual, residual)

CSVPath=tempdir()
fcsv=paste(CSVPath,"/futuresdata.csv",sep="")
write.csv(data[(nn+1):(length(data$y)),],file=fcsv,row.names=FALSE)
exclusionVars="\"residual\",\"expected\", \"actual\",\"y\""
factorToNumericList=c()

# Now the dataset is prepared, try to find new features
findFeatures(outputPath="NoPath", fcsv, exclusionVars,factorToNumericList,                     
         treeGenerationMinBucket=50,
         treeSummaryMinBucket=20,
         useSubDir=FALSE)  
         
newfeat1=((data$SMIfactor==0) & (data$CAC < 2253) & (data$CAC< 1998) & (data$CAC>=1882)) * 1.0
newfeat2=((data$SMIfactor==1) & (data$SMI < 7837) & (data$SMI >= 7499)) * 1.0
newfeatures=cbind(newfeat1, newfeat2) # create columns for the newly found features
datanew=cbind(data,newfeatures)
thismodel=lm(formula=y ~ .,data=datanew)
expectednew=predict(thismodel,datanew)

requireNamespace("Metrics")
OriginalRMSE = Metrics::rmse(data$y,expected)
NewRMSE = Metrics::rmse(data$y,expectednew)

print(paste("OriginalRMSE = ",OriginalRMSE))
print(paste("NewRMSE = ",NewRMSE))

[Package featurefinder version 1.1 Index]