R: Exploratory data analysis

model.explore {ModelMap}

R Documentation

Exploratory data analysis

Description

Graphically explores the relationships between the training data and the predictor rasters.

Usage

model.explore(qdata.trainfn = NULL, folder = NULL, predList = NULL, 
predFactor = FALSE, response.name = NULL, response.type = NULL, 
response.colors = NULL, unique.rowname = NULL, OUTPUTfn = NULL, 
device.type = NULL, allow.default.graphics=FALSE, res=NULL, jpeg.res = 72, 
MAXCELL=100000, device.width = NULL, device.height = NULL, units="in", 
pointsize=12, cex=1, rastLUTfn = NULL, create.extrapolation.masks = FALSE, 
na.value = -9999, col.ramp = rainbow(101, start = 0, end = 0.5), 
col.cat = palette()[-1])

Arguments

qdata.trainfn

String. The name (full path or base name with path specified by folder) of the training data file used for building the model (file should include columns for both response and predictor variables). The file must be a comma-delimited file *.csv with column headings. qdata.trainfn can also be an R dataframe. If predictions will be made (predict = TRUE or map=TRUE) the predictor column headers must match the names of the raster layer files, or a rastLUT must be provided to match predictor columns to the appropriate raster and band. If qdata.trainfn = NULL (the default), a GUI interface prompts user to browse to the training data file.

folder

String. The folder used for all output from predictions and/or maps. Do not add ending slash to path string. If folder = NULL (default), a GUI interface prompts user to browse to a folder. To use the working directory, specify folder = getwd().

predList

String. A character vector of the predictor short names used to build the model. These names must match the column names in the training/test data files and the names in column two of the rastLUT. If predList = NULL (the default), a GUI interface prompts user to select predictors from column 2 of rastLUT.

predFactor

String. A character vector of predictor short names of the predictors from predList that are factors (i.e categorical predictors). These must be a subset of the predictor names given in predList Categorical predictors may have multiple categories.

response.name

String. The name of the response variable used to build the model. If response.name = NULL, a GUI interface prompts user to select a variable from the list of column names from training data file. response.name must be column name from the training/test data files.

response.type

String. Response type: "binary", "categorical" or "continuous". Binary response must be binary 0/1 variable with only 2 categories. All zeros will be treated as one category, and everything else will be treated as the second category.

response.colors

Data frame. A two column data frame. Column names must be:category, the response categories; and, color, the colors associated with each category.

unique.rowname

String. The name of the unique identifier used to identify each row in the training data. If unique.rowname = NULL, a GUI interface prompts user to select a variable from the list of column names from the training data file. If unique.rowname = FALSE, a variable is generated of numbers from 1 to nrow(qdata) to index each row.

OUTPUTfn

String. Filename that ouput file names will be based on.

device.type

String or vector of strings. Model validation. One or more device types for graphical output from model validation diagnostics.

Current choices:

			`"default"`	default graphics device
			`"jpeg"`	*.jpg files
			`"none"`	no graphics device generated
			`"pdf"`	*.pdf files
			`"png"`	*.png files
			`"postscript"`	*.ps files
			`"tiff"`	*.tif files

Note that the "default" device is disabled unless allow.default.graphics=TRUE. This is because these graphics are slow to produce, and if the onscreen graphics window is moved or closed while the function is in progress there is a risk of crashing the entire R session.

allow.default.graphics

Logical. Should the default on-screen graphics device be allowed. USE WITH CAUTION! These graphics are complicated and slow to produce. If the on-screen default graphics device is moved or closed before the plot is completed it can crash the entire R session.

res

Integer. Model validation. Pixels per inch for jpeg, png, and tiff plots. The default is 72dpi, good for on screen viewing. For printing, suggested setting is 300dpi.

jpeg.res

Integer. Graphical output. Deprecated. Ignored unless res not provided.

MAXCELL

Integer. Graphical output. The maximum number of raster cells used to create the graphical output. Rasters larger than this value will be subsampled for the graphical maps and figures. The default value of MAXCELL=100000 is generally a good resolution for onscreen viewing with the default jpeg resolution of 72dpi. Publication quality qraphics may require higher MAXCELL. Higher values require more memory and are slower to process.

Note: MAXCELL only affects graphical figures. Output rasters generated when create.extrapolation.masks=TRUE are always done on full resolution rasters.

device.width

Integer. Model validation. The device width for diagnostic plots in inches.

device.height

Integer. Model validation. The device height for diagnostic plots in inches.

units

Model validation. The units in which device.height and device.width are given. Can be "px" (pixels), "in" (inches, the default), "cm" or "mm".

pointsize

Integer. Model validation. The default pointsize of plotted text, interpreted as big points (1/72 inch) at res ppi

cex

Integer. Model validation. The cex for diagnostic plots.

rastLUTfn

String. The file name (full path or base name with path specified by folder) of a .csv file for a rastLUT. Alternatively, a dataframe containing the same information. The rastLUT must include 3 columns: (1) the full path and name of the raster file; (2) the shortname of each predictor / raster layer (band); (3) the layer (band) number. The shortname (column 2) must match the names predList, the predictor column names in training/test data set (qdata.trainfn and qdata.testfn, and the predictor names in model.obj.

Example of comma-delimited file:

`C:/button_test/tc99_2727subset.img,`	`tc99_2727subsetb1,`	`1`
`C:/button_test/tc99_2727subset.img,`	`tc99_2727subsetb2,`	`2`
`C:/button_test/tc99_2727subset.img,`	`tc99_2727subsetb3,`	`3`

create.extrapolation.masks

Logical. If TRUE then the raster brick containing the masks for all predictors from predList is saved as image file. The layers in this file will be in the same order as the predictors in predList

na.value

Value used in rasters to indicate NA. Note this value is only used for NA values in the predictor rasters. Note: all predictor rasters must use the same value for NA. NA values in the training data should be indicated with NA.

col.ramp

Color ramp to use for continuous predictors

col.cat

Vector. Vector of colors to use for categorical predictors.

Details

The model.explore function is intended to aid with preliminary data exploration before model building. It includes graphical tools to explore the relationships between the training data (both predictors and responses) as well as the predictor rasters. It uses the corrplot package to create a correlation plot of the continuous predictor. This can aid in interpreting the model.importance.plot output from the models, as Random Forest models divide importance between correlated predictors, while Stochastic Gradient Boosting models assing the majority of the importance to the correlated predictor that is used earlies in the model.

The model.explore function also can aid in identifying if additional training data is needed. For example, the maps of the extrapolation masks for the predictor rasters help spot areas of the map where the pixels lie outside the range of the training data, and therefore any model predictions will be extrapolations, and possibly unreliable. The user can decide to either collect additional training data, or mask out these areas of the final prediction output of model.mapmake.

To increase speed, the default behavior for large predictor rasters is to create the graphics from subsampled rasters. (Note: for categorical predictors, the full raster is always used to identify all categories found in the map area.) If create.extrapolation.masks=TRUE, then the full rasters are used for the extrapolation masks, regardless of size of the reasters. This option runs much slower, as large rasters need to be read into R a block at a time.

Value

Function does not return a value, but does create files.

Graphical files are created for each predictor variable, with file type determined by device.type. In addition, if create.extrapolation.masks, an extrapolation mask raster is created for each predictor as well as an overall extrapolation mask, with the value 1 for pixels with predictor values within the range of the training data, or categories found in the training data, and the value 0 for pixels outside the range of the training data, categories not found in the training data, or NA value. The overall extrapolation mask has 0 if any of the predictors for that pixel are extrapolated. Note that this option is much slower to run.

Note

The default graphics device is disabled unless allow.default.graphics is set to TRUE. These graphics can be slow to produce, and if the on screen graphics device is moved or closed while the graphic is in progress, it can crash R. It is recomended that graphics be written to a file by using jpeg, pdf, etc... device.type.

Author(s)

Elizabeth Freeman

Examples


## Not run: 

###########################################################################
############################# Run this set up code: #######################
###########################################################################

###Define training and test files:
qdata.trainfn = system.file("extdata", "helpexamples","DATATRAIN.csv", package = "ModelMap")

###Define folder for all output:
folder=getwd()	

###identifier for individual training and test data points
unique.rowname="ID"			

###predictors:
predList=c("TCB","TCG","TCW","NLCD")

###define which predictors are categorical:
predFactor=c("NLCD")

###Create a the filename (including path) for the rast Look up Tables ###
rastLUTfn.2001 <- system.file(	"extdata",
				"helpexamples",
				"LUT_2001.csv",
				package="ModelMap")

###Load rast LUT table, and add path to the predictor raster filenames in column 1 ###
rastLUT.2001 <- read.table(rastLUTfn.2001,header=FALSE,sep=",",stringsAsFactors=FALSE)

for(i in 1:nrow(rastLUT.2001)){
	rastLUT.2001[i,1] <- system.file("extdata",
					"helpexamples",
					rastLUT.2001[i,1],
					package="ModelMap")
}                                 

#################Continuous Response###################

###Response name and type:
response.name="BIO"
response.type="continuous"

###file name to store model:
OUTPUTfn="BIO_TCandNLCD.img"

###run model.explore

model.explore(	qdata.trainfn=qdata.trainfn,
		folder=folder,		
		predList=predList,
		predFactor=predFactor,

		response.name=response.name,
		response.type=response.type,
	
		unique.rowname=unique.rowname,

		OUTPUTfn=OUTPUTfn,
		device.type="jpeg",
		jpeg.res=144,


		# Raster arguments
		rastLUTfn=rastLUT.2001,
		na.value=-9999,

		# colors for continuous predictors
		col.ramp=rainbow(101,start=0,end=.5), 
		# colors for categorical predictors
		col.cat=c("wheat1","springgreen2","darkolivegreen4",
			  "darkolivegreen2","yellow","thistle2",
			  "brown2","brown4")
)

## End(Not run) # end dontrun

[Package ModelMap version 3.4.0.4 Index]