R: Optimal Number of Factor Levels

optilevels {opticut}

R Documentation

Optimal Number of Factor Levels

Description

Finds the optimal number of factor levels given the data and a model using a likelihood-based agglomerative algorithm.

Usage

optilevels(y, x, z = NULL, alpha = 0, dist = "gaussian", ...)

## S3 method for class 'optilevels'
bestmodel(object, ...)

Arguments

`y`	vector of observations.
`x`	a factor or a matrix of proportions (i.e. the values 0 and 1 should have consistent meaning across the columns, often through a unit sum constraint). It is the user's responsibility to ensure that values supplied for `x` are sensible. `x` is not expected to include an intercept.
`z`	a design matrix with predictor variables besides the one(s) defined via the argument `x`. It is the user's responsibility to ensure that values supplied for `z` are sensible and it also makes sense to bind `x` and `z` together. Variables in `z` should be centered (mean 0) (and possibly normalized by SD), because the design matrix from `x` is not expected to include an intercept.
`alpha`	numeric [0-1], weighting factor for calculating information criteria for model selection (i.e. IC = (1-alpha)AIC + alphaBIC, also referred to as CAIC: consistent AIC).
`dist`	character, distribution argument passed to underlying functions, see listed on the help page of `opticut` (except for `dist = "zip2"`, `dist = "zinb2"` `dist = "rsf"`, and `dist = "rspf"`).
`object`	fitted object.
`...`	other arguments passed to the underlying functions, see `opticut1`.

Value

An object of class 'optilevels' that is a list with the following elements:

"delta": delta IC values along the selection path considering best models.
"ic": IC values along the selection path considering best models.
"coef": matrix of coefficients (linear predictor scale) corresponding to argument x along the selection path considering best models.
"zcoef": matrix of coefficients (linear predictor scale) corresponding to argument z when not NULL along the selection path considering best models, or NULL.
"rank": matrix ranks based on the coefficients along the selection path considering best models. Ranking uses the default ties.method = "average" in rank.
"deltalist": delta IC values along the selection path considering all competing models.
"iclist": IC values along the selection path considering all competing models.
"coeflist": matrix of coefficients (linear predictor scale) corresponding to argument x along the selection path considering all competing models.
"zcoeflist": matrix of coefficients (linear predictor scale) corresponding to argument z when not NULL along the selection path considering all competing models, or NULL.
"ranklist": matrix ranks based on the coefficients along the selection path considering all competing models.
"levels": list of (merged) factor levels along the selection path considering best models.
"Y": vector of observations (argument y).
"X": design matrix component corresponding to argument x.
"Z": design matrix component corresponding to argument z.
"alpha": weighting argument.
"dist": distribution argument.
"factor": logical, indicating if argument x is a factor (TRUE) or a matrix (FALSE).

bestmodel returns the best supported model for further manipulation (e.g. prediction).

Author(s)

Peter Solymos <psolymos@gmail.com>

Examples

## --- Factor levels with Gaussian distribution
## simple example from Legendre 2013
## Indicator Species: Computation, in
## Encyclopedia of Biodiversity, Volume 4
## https://dx.doi.org/10.1016/B978-0-12-384719-5.00430-5
gr <- as.factor(paste0("X", rep(1:5, each=5)))
spp <- cbind(Species1=rep(c(4,6,5,3,2), each=5),
    Species2=c(rep(c(8,4,6), each=5), 4,4,2, rep(0,7)),
    Species3=rep(c(18,2,0,0,0), each=5))
rownames(spp) <- gr
## must add some noise to avoid perfect fit
spp[6, "Species1"] <- 7
spp[1, "Species3"] <- 17
spp

ol <- optilevels(spp[,"Species3"], gr)
ol[c("delta", "coef", "rank", "levels")]

## get the final factor level
gr1 <- gr
levels(gr1) <- ol$level[[length(ol$level)]]
table(gr, gr1)

## compare the models
o0 <- lm(spp[,"Species3"] ~ gr - 1)
o1 <- lm(spp[,"Species3"] ~ gr1 - 1)
data.frame(AIC(o0, o1), delta=AIC(o0, o1)$AIC - AIC(o0))
ol$delta # should be identical

## --- Proportions with Poisson distribution
## simulation
set.seed(123)
n <- 500 # number of observations
k <- 5 # number of habitat types
b <- c(-1, -0.2, -0.2, 0.5, 1)
names(b) <- LETTERS[1:k]
x <- replicate(k, exp(rnorm(n)))
x <- x / rowSums(x) # proportions
X <- model.matrix(~.-1, data=data.frame(x))
lam <- exp(drop(crossprod(t(X), b)))
y <- rpois(n, lam)

z <- optilevels(y, x, dist="poisson")

## best model refit
bestmodel(z)

## estimates
plogis(z$coef)
plogis(b)
## optimal classification
z$rank

## get the final matrix
x1 <- mefa4::groupSums(x, 2, z$levels[[length(z$levels)]])
head(x)
head(x1)

## compare the models
m0 <- glm(y ~ x - 1, family="poisson")
m1 <- glm(y ~ x1 - 1, family="poisson")
data.frame(AIC(m0, m1), delta=AIC(m0, m1)$AIC - AIC(m0))
z$delta # should be identical

## Not run: 
## dolina example with factor
data(dolina)
dolina$samp$stratum <- as.integer(dolina$samp$stratum)
y <- dolina$xtab[dolina$samp$method == "Q", "ppyg"]
x <- dolina$samp$mhab[dolina$samp$method == "Q"]
z <- scale(model.matrix(~ stratum + lmoist - 1,
    dolina$samp[dolina$samp$method == "Q",]))

## without additional covariates
dol1 <- optilevels(y, x, z=NULL, dist="poisson")
dol1$rank
summary(bestmodel(dol1))

## with additional covariates
dol2 <- optilevels(y, x, z, dist="poisson")
dol2$rank
summary(bestmodel(dol2))

## compare the two models
AIC(bestmodel(dol1), bestmodel(dol2))

## End(Not run)

[Package opticut version 0.1-3 Index]