aglm {aglm}R Documentation

Fit an AGLM model with no cross-validation

Description

A basic fitting function with given \alpha and \lambda (s). See aglm-package for more details on \alpha and \lambda.

Usage

aglm(
  x,
  y,
  qualitative_vars_UD_only = NULL,
  qualitative_vars_both = NULL,
  qualitative_vars_OD_only = NULL,
  quantitative_vars = NULL,
  use_LVar = FALSE,
  extrapolation = "default",
  add_linear_columns = TRUE,
  add_OD_columns_of_qualitatives = TRUE,
  add_interaction_columns = FALSE,
  OD_type_of_quantitatives = "C",
  nbin.max = NULL,
  bins_list = NULL,
  bins_names = NULL,
  family = c("gaussian", "binomial", "poisson"),
  ...
)

Arguments

x

A design matrix. Usually a data.frame object is expected, but a matrix object is fine if all columns are of a same class. Each column may have one of the following classes, and aglm will automatically determine how to handle it:

  • numeric: interpreted as a quantitative variable. aglm performs discretization by binning, and creates dummy variables suitable for ordered values (named O-dummies/L-variables).

  • factor (unordered) or logical : interpreted as a qualitative variable without order. aglm creates dummy variables suitable for unordered values (named U-dummies).

  • ordered: interpreted as a qualitative variable with order. aglm creates both O-dummies and U-dummies.

These dummy variables are added to x and form a larger matrix, which is used internally as an actual design matrix. See our paper for more details on O-dummies, U-dummies, and L-variables.

If you need to change the default behavior, use the following options: qualitative_vars_UD_only, qualitative_vars_both, qualitative_vars_OD_only, and quantitative_vars.

y

A response variable.

qualitative_vars_UD_only

Used to change the default behavior of aglm for given variables. Variables specified by this parameter are considered as qualitative variables and only U-dummies are created as auxiliary columns. This parameter may have one of the following classes:

  • integer: specifying variables by index.

  • character: specifying variables by name.

qualitative_vars_both

Same as qualitative_vars_UD_only, except that both O-dummies and U-dummies are created for specified variables.

qualitative_vars_OD_only

Same as qualitative_vars_UD_only, except that both only O-dummies are created for specified variables.

quantitative_vars

Same as qualitative_vars_UD_only, except that specified variables are considered as quantitative variables.

use_LVar

Set to use L-variables. By default, aglm uses O-dummies as the representation of a quantitative variable. If use_LVar=TRUE, L-variables are used instead.

extrapolation

Used to control values of linear combination for quantitative variables, outside where the data exists. By default, values of a linear combination outside the data is extended based on the slope of the edges of the region where the data exists. You can set extrapolation="flat" to get constant values outside the data instead.

add_linear_columns

By default, for quantitative variables, aglm expands them by adding dummies and the original columns, i.e. the linear effects, are remained in the resulting model. You can set add_linear_columns=FALSE to drop linear effects.

add_OD_columns_of_qualitatives

Set to FALSE if you do not want to use O-dummies for qualitative variables with order (usually, columns with ordered class).

add_interaction_columns

If this parameter is set to TRUE, aglm creates an additional auxiliary variable x_i * x_j for each pair ⁠(x_i, x_j)⁠ of variables.

OD_type_of_quantitatives

Used to control the shape of linear combinations obtained by O-dummies for quantitative variables (deprecated).

nbin.max

An integer representing the maximum number of bins when aglm perform binning for quantitative variables.

bins_list

Used to set custom bins for variables with O-dummies.

bins_names

Used to set custom bins for variables with O-dummies.

family

A family object or a string representing the type of the error distribution. Currently aglm supports gaussian, binomial, and poisson.

...

Other arguments are passed directly when calling glmnet().

Value

A model object fitted to the data. Functions such as predict and plot can be applied to the returned object. See AccurateGLM-class for more details.

Author(s)

References

Suguru Fujita, Toyoto Tanaka, Kenji Kondo and Hirokazu Iwasawa. (2020) AGLM: A Hybrid Modeling Method of GLM and Data Science Techniques,
https://www.institutdesactuaires.com/global/gene/link.php?doc_id=16273&fg=1
Actuarial Colloquium Paris 2020

Examples


#################### Gaussian case ####################

library(MASS) # For Boston
library(aglm)

## Read data
xy <- Boston # xy is a data.frame to be processed.
colnames(xy)[ncol(xy)] <- "y" # Let medv be the objective variable, y.

## Split data into train and test
n <- nrow(xy) # Sample size.
set.seed(2018) # For reproducibility.
test.id <- sample(n, round(n/4)) # ID numbders for test data.
test <- xy[test.id,] # test is the data.frame for testing.
train <- xy[-test.id,] # train is the data.frame for training.
x <- train[-ncol(xy)]
y <- train$y
newx <- test[-ncol(xy)]
y_true <- test$y

## Fit the model
model <- aglm(x, y)  # alpha=1 (the default value)

## Predict for various alpha and lambda
lambda <- 0.1
y_pred <- predict(model, newx=newx, s=lambda)
rmse <- sqrt(mean((y_true - y_pred)^2))
cat(sprintf("RMSE for lambda=%.2f: %.5f \n\n", lambda, rmse))

lambda <- 1.0
y_pred <- predict(model, newx=newx, s=lambda)
rmse <- sqrt(mean((y_true - y_pred)^2))
cat(sprintf("RMSE for lambda=%.2f: %.5f \n\n", lambda, rmse))

alpha <- 0
model <- aglm(x, y, alpha=alpha)

lambda <- 0.1
y_pred <- predict(model, newx=newx, s=lambda)
rmse <- sqrt(mean((y_true - y_pred)^2))
cat(sprintf("RMSE for alpha=%.2f and lambda=%.2f: %.5f \n\n", alpha, lambda, rmse))

#################### Binomial case ####################

library(aglm)
library(faraway)

## Read data
xy <- nes96

## Split data into train and test
n <- nrow(xy) # Sample size.
set.seed(2018) # For reproducibility.
test.id <- sample(n, round(n/5)) # ID numbders for test data.
test <- xy[test.id,] # test is the data.frame for testing.
train <- xy[-test.id,] # train is the data.frame for training.
x <- train[, c("popul", "TVnews", "selfLR", "ClinLR", "DoleLR", "PID", "age", "educ", "income")]
y <- train$vote
newx <- test[, c("popul", "TVnews", "selfLR", "ClinLR", "DoleLR", "PID", "age", "educ", "income")]

## Fit the model
model <- aglm(x, y, family="binomial")

## Make the confusion matrix
lambda <- 0.1
y_true <- test$vote
y_pred <- levels(y_true)[as.integer(predict(model, newx, s=lambda, type="class"))]

print(table(y_true, y_pred))

#################### use_LVar and extrapolation ####################

library(MASS) # For Boston
library(aglm)

## Randomly created train and test data
set.seed(2021)
sd <- 0.2
x <- 2 * runif(1000) + 1
f <- function(x){x^3 - 6 * x^2 + 13 * x}
y <- f(x) + rnorm(1000, sd = sd)
xy <- data.frame(x=x, y=y)
x_test <- seq(0.75, 3.25, length.out=101)
y_test <- f(x_test) + rnorm(101, sd=sd)
xy_test <- data.frame(x=x_test, y=y_test)

## Plot
nbin.max <- 10
models <- c(cv.aglm(x, y, use_LVar=FALSE, extrapolation="default", nbin.max=nbin.max),
            cv.aglm(x, y, use_LVar=FALSE, extrapolation="flat", nbin.max=nbin.max),
            cv.aglm(x, y, use_LVar=TRUE, extrapolation="default", nbin.max=nbin.max),
            cv.aglm(x, y, use_LVar=TRUE, extrapolation="flat", nbin.max=nbin.max))

titles <- c("O-Dummies with extrapolation=\"default\"",
            "O-Dummies with extrapolation=\"flat\"",
            "L-Variables with extrapolation=\"default\"",
            "L-Variables with extrapolation=\"flat\"")

par.old <- par(mfrow=c(2, 2))
for (i in 1:4) {
  model <- models[[i]]
  title <- titles[[i]]

  pred <- predict(model, newx=x_test, s=model@lambda.min, type="response")

  plot(x_test, y_test, pch=20, col="grey", main=title)
  lines(x_test, f(x_test), lty="dashed", lwd=2)  # the theoretical line
  lines(x_test, pred, col="blue", lwd=3)  # the smoothed line by the model
}
par(par.old)

[Package aglm version 0.4.0 Index]