fsrfan {fsdaR} | R Documentation |
Robust transformations for regression
Description
The transformations for negative and positive responses were determined
by Yeo and Johnson (2000) by imposing the smoothness condition that the second
derivative of zYJ (\lambda
) with respect to y be smooth at y = 0. However some authors,
for example Weisberg (2005), query the physical interpretability of this constraint
which is oftern violated in data analysis. Accordingly, Atkinson et al. (2019) and (2020)
extend the Yeo-Johnson transformation to allow two values of the transformations
parameter: \lambda_N
for negative observations and \lambda_P
for non-negative ones.
FSRfan monitors:
the t test associated with the constructed variable computed assuming the same transformation parameter for positive and negative observations fixed. In short we call this test, "global score test for positive observations".
the t test associated with the constructed variable computed assuming a different transformation for positive observations keeping the value of the transformation parameter for negative observations fixed. In short we call this test, "test for positive observations".
the t test associated with the constructed variable computed assuming a different transformation for negative observations keeping the value of the transformation parameter for positive observations fixed. In short we call this test, "test for negative observations".
the F test for the joint presence of the two constructed variables described in points 2) and 3).
the F likelihood ratio test based on the MLE of
\lambda_P
and\lambda_N
. In this case the residual sum of squares of the null model bsaed on a single trasnformation parameter\lambda
is compared with the residual sum of squares of the model based on data transformed data using MLE of\lambda_P
and\lambda_N
.
Usage
fsrfan(x, ...)
## S3 method for class 'formula'
fsrfan(
formula,
data,
subset,
weights,
na.action,
model = TRUE,
x.ret = FALSE,
y.ret = FALSE,
contrasts = NULL,
offset,
...
)
## Default S3 method:
fsrfan(
x,
y,
intercept = TRUE,
plot = FALSE,
family = c("BoxCox", "YJ", "YJpn", "YJall"),
la = c(-1, -0.5, 0, 0.5, 1),
lms,
alpha = 0.75,
h,
init,
msg = FALSE,
nocheck = FALSE,
nsamp = 1000,
conflev = 0.99,
xlab,
ylab,
main,
xlim,
ylim,
lwd = 2,
lwd.env = 1,
trace = FALSE,
...
)
## S3 method for class 'fsrfan'
plot(
x,
conflev = 0.99,
xlim,
ylim,
xlab = "Subset of size m",
ylab = "Score test statistic",
main = "Fan plot",
col,
lty,
lwd = 2.5,
lwd.env = 1,
...
)
Arguments
x |
An Missing values (NA's) and infinite values (Inf's) are allowed, since observations (rows) with missing or infinite values will automatically be excluded from the computations. |
... |
potential further arguments passed to lower level functions. |
formula |
a |
data |
data frame from which variables specified in
|
subset |
an optional vector specifying a subset of observations to be used in the fitting process. |
weights |
an optional vector of weights to be used NOT USED YET. |
na.action |
a function which indicates what should happen
when the data contain |
model |
|
x.ret |
|
y.ret |
|
contrasts |
an optional list. See the |
offset |
this can be used to specify an a priori
known component to be included in the linear predictor
during fitting. An |
y |
Response variable. A vector with |
intercept |
wheather to use constant term (default is |
plot |
If |
family |
string which identifies the family of transformations which must be used. Possible values are
|
la |
values of the transformation parameter for which it is necessary
to compute the score test. Default value of lambda is
|
lms |
how to find the initlal subset to initialize the search. If |
alpha |
the percentage (roughly) of squared residuals whose sum will be minimized,
by default |
h |
The number of observations that have determined the least trimmed squares
estimator, scalar. |
init |
Search initialization. It specifies the initial subset size to start
monitoring the value of the score test. If |
msg |
Controls whether to display or not messages on the screen. If |
nocheck |
Whether to check input arguments. If |
nsamp |
number of subsamples which will be extracted to find the robust estimator. If Remark: if the number of all possible subset is <1000 the default is to extract all subsets
otherwise just 1000. If |
conflev |
Confidence level for the bands (default is 0.99, that is, we plot two horizontal lines corresponding to values -2.58 and 2.58). |
xlab |
A label for the X-axis, default is 'Subset size m' |
ylab |
A label for the Y-axis, default is 'Score test statistic' |
main |
A label for the title, default is 'Fan plot' |
xlim |
Minimum and maximum for the X-axis |
ylim |
Minimum and maximum for the Y-axis |
lwd |
The line width of the curves which contain the score test, a positive number, default is |
lwd.env |
The line width of the lines associated with the envelopes, a positive number, default is |
trace |
Whether to print intermediate results. Default is |
col |
a vector specifying the colors for the lines, each
one corresponding to a |
lty |
a vector specifying the line types for the lines, each
one corresponding to a |
Value
An S3 object of class fsrfan.object
will be returned which is basically a list
containing the following elements:
-
la
: vector containing the values of lambda for which fan plot is constructed -
bs
: matrix of sizep X length(la)
containing the units forming the initial subset for each value of lambda -
Score
: a matrix containing the values of the score test for each value of the transformation parameter:1st col = fwd search index;
2nd col = value of the score test in each step of the fwd search for la[1]
...
-
Scorep
: matrix containing the values of the score test for positive observations for each value of the transformation parameter.Note: this output is present only if input option
family='YJpn'
orfamily='YJall'
. -
Scoren
: matrix containing the values of the score test for negative observations for each value of the transformation parameter.Note: this output is present only if input option 'family' is 'YJpn' or 'YJall'.
-
Scoreb
: matrix containing the values of the score test for the joint presence of both constructed variables (associated with positive and negative observations) for each value of the transformation parameter. In this case the reference distribution is theF
with 2 andsubset_size - p
degrees of freedom.Note: this output is present only if input option
family='YJall'
. -
Un
: a three-dimensional array containinglength(la)
matrices of sizeretnUn=(n-init) X retpUn=11
. Each matrix contains the unit(s) included in the subset at each step in the search associated with the corresponding element ofla
.REMARK: at each step the new subset is compared with the old subset.
Un
contains the unit(s) present in the new subset but not in the old one.
Author(s)
FSDA team, valentin.todorov@chello.at
References
Atkinson, A.C. and Riani, M. (2000), Robust Diagnostic Regression Analysis Springer Verlag, New York.
Atkinson, A.C. and Riani, M. (2002), Tests in the fan plot for robust, diagnostic transformations in regression, Chemometrics and Intelligent Laboratory Systems, 60, pp. 87–100.
Atkinson, A.C. Riani, M. and Corbellini A. (2019), The analysis of transformations for profit-and-loss data, Journal of the Royal Statistical Society, Series C, "Applied Statistics", 69, pp. 251–275. doi:10.1111/rssc.12389
Atkinson, A.C. Riani, M. and Corbellini A. (2021), The Box-Cox Transformation: Review and Extensions, Statistical Science, 36(2), pp. 239–255. doi:10.1214/20-STS778.
Examples
## Not run:
data(wool)
XX <- wool
y <- XX[, ncol(XX)]
X <- XX[, 1:(ncol(XX)-1), drop=FALSE]
out <- fsrfan(X, y) # call 'fsrfan' with all default parameters
out <- fsrfan(cycles~., data=wool) # use the formula interface
set.seed(10)
out <- fsrfan(cycles~., data=wool, plot=TRUE) # call 'fsrfan' and produce the plot
plot(out) # use the plot method on the fsrfan object
plot(out, conflev=c(0.9, 0.95, 0.99)) # change the confidence leel in the plot method
##======================
##
## fsrfan() with all default options.
## Store values of the score test statistic for the five most common
## values of $\lambda$. Produce also a fan plot and display it on the screen.
## Common part to all examples: load 'wool' data set.
data(wool)
head(wool)
dim(wool)
## The function fsrfan() stores the score test statistic.
## In this case we use the five most common values of lambda are considered
out <- fsrfan(cycles~., data=wool)
plot(out)
## fanplot(out) # Not yet implemented in fsdaR
## The fan plot shows the log transformation is diffused throughout the data
## and does not depend on the presence of particular observations.
##======================
##
## Example specifying 'lambda'.
## Produce a fan plot for each value of 'lambda' in the vector 'la'.
## Extract in matrix 'Un' the units which entered the search in each step
data(wool)
out <- fsrfan(cycles~., data=wool, la=c(-1, -0.5, 0, 0.5), plot=TRUE)
plot(out)
out$Un[,2,]
##======================
## Example specifying the confidence level and the initial starting point for monitoring.
## Construct the fan plot specifying the confidence level and the initial starting point
## for monitoring.
data(wool)
out <- fsrfan(cycles~., data=wool, init=ncol(wool)+1, nsamp=0, conflev=0.95, plots=TRUE)
plot(out, conflev=0.95)
##=====================
## Example with starting point based on LTS.
## Extract all subsamples, construct a fan plot specifying the confidence level
## and the initial starting point for monitoring based on p+2 observations,
## strong line width for lines associated with the confidence bands.
data(wool)
out <- fsrfan(cycles~., data=wool, init=ncol(wool)+1, nsamp=0, lms=0,
lwd.env=3, plot=TRUE)
plot(out, lwd.env=3)
##=====================
## Fan plot using the loyalty cards data.
## In this example, 'la' is the vector contanining the most common values
## of the transformation parameter.
## Store the score test statistics for the specified values of lambda
## and automatically produce the fan plot
data(loyalty)
head(loyalty)
dim(loyalty)
## la is a vector contanining the most common values of the transformation parameter
out <- fsrfan(amount_spent~., data=loyalty, la=c(0, 1/3, 0.4, 0.5),
init=ncol(loyalty)+1, plot=TRUE, lwd=3)
plot(out, lwd=3)
## The fan plot shows that even if the third root is the best value of the transformation
## parameter at the end of the search, in earlier steps it lies very close to the upper
## rejection region. The best value of the transformation parameter seems to be the one
## associated with la=0.4, which is always the confidence bands but at the end of search,
## due to the presence of particular observations it goes below the lower rejection line.
##=====================
## Compare BoxCox with Yeo and Johnson transformation.
## Store values of the score test statistic for the five most common
## values of lambda. Produce also a fan plot and display it on the screen.
## Common part to all examples: load wool dataset.
data(wool)
## Store the score test statistic using Box Cox transformation.
outBC <- fsrfan(cycles~., data=wool, nsamp=0)
## Store the score test statistic using Yeo and Johnson transformation.
outYJ <- fsrfan(cycles~., data=wool, family="YJ", nsamp=0)
## Not yet fully implemented
## fanplot(outBC, main="Box Cox")
## fanplot(outYJ,main="Yeo and Johnson")
plot(outBC, main="Box Cox")
plot(outYJ, main="Yeo and Johnson")
cat("\nMaximum difference in absolute value: ",
max(max(abs(outYJ$Score - outBC$Score), na.rm=TRUE)), "\n")
##======================
## Call 'fsrfan' with Yeo-Johnson (YJ) transformation
out <- fsrfan(cycles~., data=wool, family="YJ")
plot(out)
## End(Not run)