roc {analogue} R Documentation

## ROC curve analysis

### Description

Fits Receiver Operator Characteristic (ROC) curves to training set data. Used to determine the critical value of a dissimilarity coefficient that best descriminate between assemblage-types in palaeoecological data sets, whilst minimising the false positive error rate (FPF).

### Usage

```roc(object, groups, k = 1, ...)

## Default S3 method:
roc(object, groups, k = 1, thin = FALSE,
max.len = 10000, ...)

## S3 method for class 'mat'
roc(object, groups, k = 1, ...)

## S3 method for class 'analog'
roc(object, groups, k = 1, ...)
```

### Arguments

 `object` an R object. `groups` a vector of group memberships, one entry per sample in the training set data. Can be a factor, and will be coerced to one if supplied vecvtor is not a factor. `k` numeric; the `k` closest analogues to use to calculate ROC curves. `thin` logical; should the points on the ROC curve be thinned? See Details, below. `max.len` numeric; length of analolgue and non-analogue vectors. Used as limit to thin points on ROC curve to. `...` arguments passed to/from other methods.

### Details

A ROC curve is generated from the within-group and between-group dissimilarities.

For each level of the grouping vector (`groups`) the dissimilarity between each group member and it's k closest analogues within that group are compared with the k closest dissimilarities between the non-group member and group member samples.

If one is able to discriminate between members of different group on the basis of assemblage dissimilarity, then the dissimilarities between samples within a group will be small compared to the dissimilarities between group members and non group members.

`thin` is useful for large problems, where the number of analogue and non-analogue distances can conceivably be large and thus overflow the largest number R can work with. This option is also useful to speed up computations for large problems. If `thin == TRUE`, then the larger of the analogue or non-analogue distances is thinned to a maximum length of `max.len`. The smaller set of distances is scaled proportionally. In thinning, we approximate the distribution of distances by taking `max.len` (or a fraction of `max.len` for the smaller set of distances) equally-spaced probability quantiles of the distribution as a new set of distances.

### Value

A list with two components; i, `statistics`, a summary of ROC statistics for each level of `groups` and a combined ROC analysis, and ii, `roc`, a list of ROC objects, one per level of `groups`. For the latter, each ROC object is a list, with the following components:

 `TPF` The true positive fraction. `FPE` The false positive error
 `optimal` The optimal dissimilarity value, asessed where the slope of the ROC curve is maximal. `AUC` The area under the ROC curve. `se.fit` Standard error of the AUC estimate. `n.in` numeric; the number of samples within the current group. `n.out` numeric; the number of samples not in the current group. `p.value` The p-value of a Wilcoxon rank sum test on the two sets of dissimilarities. This is also known as a Mann-Whitney test. `roc.points` The unique dissimilarities at which the ROC curve was evaluated `max.roc` numeric; the position along the ROC curve at which the slope of the ROC curve is maximal. This is the index of this point on the curve. `prior` numeric, length 2. Vector of observed prior probabilities of true analogue and true non-analogues in the group. `analogue` a list with components `yes` and `no` containing the dissimilarities for true analogue and true non-analogues in the group.

### Author(s)

Gavin L. Simpson, based on code from Thomas Lumley to optimise the calculation of the ROC curve.

### References

Brown, C.D., and Davis, H.T. (2006) Receiver operating characteristics curves and related decision measures: A tutorial. Chemometrics and Intelligent Laboratory Systems 80, 24–38.

Gavin, D.G., Oswald, W.W., Wahl, E.R. and Williams, J.W. (2003) A statistical approach to evaluating distance metrics and analog assignments for pollen records. Quaternary Research 60, 356–367.

Henderson, A.R. (1993) Assessing test accuracy and its clinical consequences: a primer for receiver operating characteristic curve analysis. Annals of Clinical Biochemistry 30, 834–846.

`mat` for fitting of MAT models. `bootstrap.mat` and `mcarlo` for alternative methods for selecting critical values of analogue-ness for dissimilarity coefficients.

### Examples

```## load the example data
data(swapdiat, swappH, rlgh)

## merge training and test set on columns
dat <- join(swapdiat, rlgh, verbose = TRUE)

## extract the merged data sets and convert to proportions
swapdiat <- dat[] / 100
rlgh <- dat[] / 100

## fit an analogue matching (AM) model using the squared chord distance
## measure - need to keep the training set dissimilarities
swap.ana <- analog(swapdiat, rlgh, method = "SQchord",
keep.train = TRUE)

## fit the ROC curve to the SWAP diatom data using the AM results
## Generate a grouping for the SWAP lakes
METHOD <- if (getRversion() < "3.1.0") {"ward"} else {"ward.D"}
clust <- hclust(as.dist(swap.ana\$train), method = METHOD)
grps <- cutree(clust, 12)

## fit the ROC curve
swap.roc <- roc(swap.ana, groups = grps)
swap.roc

## draw the ROC curve
plot(swap.roc, 1)

## draw the four default diagnostic plots
layout(matrix(1:4, ncol = 2))
plot(swap.roc)
layout(1)
```

[Package analogue version 0.17-6 Index]