mfp2 {mfp2} | R Documentation |
Multivariable Fractional Polynomial Models with Extensions
Description
Selects the multivariable fractional polynomial (MFP) model that best predicts
the outcome variable. It also has the ability to model a sigmoid relationship
between x
and an outcome variable y
using the approximate cumulative
distribution (ACD) transformation proposed by Royston (2014).
This function provides two interfaces for input data: one for inputting
data matrix x
and outcome vector y
directly and the other for using a
formula
object together with a dataframe data
. Both interfaces are
equivalent in terms of functionality.
Usage
mfp2(x, ...)
## Default S3 method:
mfp2(
x,
y,
weights = NULL,
offset = NULL,
cycles = 5,
scale = NULL,
shift = NULL,
df = 4,
center = TRUE,
subset = NULL,
family = c("gaussian", "poisson", "binomial", "cox"),
criterion = c("pvalue", "aic", "bic"),
select = 0.05,
alpha = 0.05,
keep = NULL,
xorder = c("ascending", "descending", "original"),
powers = NULL,
ties = c("breslow", "efron", "exact"),
strata = NULL,
nocenter = NULL,
acdx = NULL,
ftest = FALSE,
control = NULL,
verbose = TRUE,
...
)
## S3 method for class 'formula'
mfp2(
formula,
data,
weights = NULL,
offset = NULL,
cycles = 5,
scale = NULL,
shift = NULL,
df = 4,
center = TRUE,
subset = NULL,
family = c("gaussian", "poisson", "binomial", "cox"),
criterion = c("pvalue", "aic", "bic"),
select = 0.05,
alpha = 0.05,
keep = NULL,
xorder = c("ascending", "descending", "original"),
powers = NULL,
ties = c("breslow", "efron", "exact"),
strata = NULL,
nocenter = NULL,
ftest = FALSE,
control = NULL,
verbose = TRUE,
...
)
Arguments
x |
for |
... |
not used. |
y |
for |
weights |
a vector of observation weights of length nobs.
Default is |
offset |
a vector of length nobs that is included in the linear
predictor. Useful for the poisson family (e.g. log of exposure time).
Default is |
cycles |
an integer, specifying the maximum number of iteration cycles. Default is 5. |
scale |
a numeric vector of length nvars or single numeric specifying
scaling factors. If a single numeric, then the value will be replicated as
necessary. The formula interface |
shift |
a numeric vector of length nvars or a single numeric specifying
shift terms. If a single numeric, then the value will be replicated as
necessary. The formula interface |
df |
a numeric vector of length nvars or a single numeric that sets the
(default) degrees of freedom (df) for each predictor. If a single numeric,
then the value will be replicated as necessary. The formula interface
|
center |
a logical determining whether variables are centered before
final model fitting. The default |
subset |
an optional vector specifying a subset of observations
to be used in the fitting process. Default is |
family |
a character string representing a |
criterion |
a character string specifying the criterion used to select
variables and FP models of different degrees.
Default is to use p-values in which case the user can specify
the nominal significance level (or use default level of 0.05) for variable and
functional form selection (see |
select |
a numeric vector of length nvars or a single numeric that
sets the nominal significance levels for variable selection on each predictor
by backward elimination. If a single numeric, then the value will be replicated
as necessary. The formula interface |
alpha |
a numeric vector of length nvars or a single numeric that
sets the significance levels for testing between FP models of
different degrees. If a single numeric, then the value will be replicated
as necessary. The formula interface |
keep |
a character vector with names of variables to be kept
in the model. In case that |
xorder |
a string determining the order of entry of the covariates
into the model-selection algorithm. The default is |
powers |
a named list of numeric values that sets the permitted FP
powers for each covariate. The default is NULL, and each covariate is assigned
|
ties |
a character string specifying the method for tie handling in
Cox regression. If there are no tied death times all the methods are
equivalent. Default is the Breslow method. This argument is used for Cox
models only and has no effect on other model families.
See |
strata |
a numeric vector or matrix of variables that define strata
to be used for stratification in a Cox model. A new factor, whose levels are
all possible combinations of the variables supplied will be created.
Default is |
nocenter |
a numeric vector with a list of values for fitting Cox
models. See |
acdx |
a numeric vector of names of continuous variables to undergo
the approximate cumulative distribution (ACD) transformation.
It also invokes the function-selection procedure to determine the
best-fitting FP1(p1, p2) model (see Details section). Not present in the
formula interface |
ftest |
a logical; for normal error models with small samples, critical
points from the F-distribution can be used instead of Chi-Square
distribution. Default |
control |
a list object with parameters controlling model fit details.
Returned by either |
verbose |
a logical; run in verbose mode. |
formula |
for |
data |
for |
Value
mfp2()
returns an object of class inheriting from glm
or copxh
,
depending on the family
parameter.
The function summary()
(i.e. summary.mfp2()
) can be used to obtain or
print a summary of the results.
The generic accessor function coef()
can be used to extract the vector of
coefficients from the fitted model object.
The generic predict()
can be used to obtain predictions from the fitted
model object.
An object of class mfp2
is a list containing all entries as for glm
or coxph
, and in addition the following entries:
convergence_mfp: logical value indicating convergence of mfp algorithm.
fp_terms: a data.frame with information on fractional polynomial terms.
transformations: a data.frame with information on shifting, scaling and centering for all variables.
fp_powers: a list with all powers of fractional polynomial terms. Each entry of the list is named according to the transformation of the variable.
acd: a vector with information for which variables the acd transformation was applied.
x_original: the scaled and shifted input matrix but without transformations.
y: the original outcome variable.
x: the final transformed input matrix used to fit the final model.
call_mfp: the call to the
mfp2()
function.family_string: the family stored as character string.
The mfp2
object may contain further information depending on family.
Methods (by class)
-
mfp2(default)
: Default method using input matrixx
and outcome vectory
. -
mfp2(formula)
: Provides formula interface formfp2
.
Brief summary of FPs
In the following we denote fractional polynomials for a variable x
by
increasing complexity as either FP1 or FP2. In this example,
FP2(p1, p2)
for p1\neq p2
is the most flexible FP transformation,
where
FP2(p1, p2) = \beta_1 x^{p1} + \beta_2 x^{p2}.
When p1 = p2
(repeated powers), the FP2 model is given by
FP2(p1, p2) = \beta_1 x^{p1} + \beta_2 x^{p1}log(x).
The powers p1
and p2
are usually chosen from a predefined set
of powers S = {(-2, -1, -0.5, 0, 0.5, 1, 2, 3)}
where the power
of 0 indicates the natural logarithm. The best FP2 is then estimated by using a
closed testing procedure that seeks the best combination from all 36 pairs of
powers (p1, p2)
. Functions that only involve a single power of
the variable are denoted as FP1, i.e.
FP1(p1) = \beta_1 x^{p1}.
For details see e.g. Sauerbrei et al (2006).
Details on family
option
mfp2()
supports the family object as used by stats::glm()
. The built in
families are specified via a character string. mfp2(..., family = "binomial")
fits a logistic regression model, while mfp2(..., family = "gaussian")
fits a linear regression (ordinary least squares) model.
For Cox models, the response should preferably be a Surv
object,
created by the survival::Surv()
function, and the family = "cox"
.
Only right-censored data are currently supported. To fit stratified Cox
models, the strata
option can be used, or alternatively strata
terms
can be included in the model formula when using the formula interface
mfp2.formula
.
Details on shifting, scaling, centering
Fractional polynomials are defined only for positive variables due to the
use of logarithms and other powers. Thus, mfp2()
estimates shifts for
each variables to ensure positivity or assumes that the variables are
already positive when computing fractional powers of the input variables
in case that shifting is disabled manually.
If the values of the variables are too large or too small, it is important to
conduct variable scaling to reduce the chances of numerical underflow or
overflow which can lead to inaccuracies and difficulties in estimating the
model. Scaling can be done automatically or by directly specifying the
scaling values so that the magnitude of the x
values are not too extreme.
By default scaling factors are estimated by the program as follows.
After adjusting the location of x
so that its minimum value is positive,
creating x'
, automatic scaling will divide each value of x'
by
10^p
where the exponent p
is given by
p = sign(k) \times floor(|k|) \quad \text{where} \quad k = log_{10} (max(x')- min(x'))
The FP transformation of x'
is centered on the mean of the observed
values of x'
. For example, for the FP1 model \beta_0 + \beta_1x^p
,
the actual model fitted by the software would be
\beta'_0 + \beta'_1(x'^p-mean(x'^p))
. This approach ensures that
the revised constant \beta'_0
or baseline hazard function in a Cox
model retains a meaningful interpretation.
So in brief: shifting is required to make input values positive, scaling
helps to bring the values to a reasonable range. Both operations are
conducted before estimating the FP powers for an input variable.
Centering, however, is done after estimating the FP functions for each
variable.
Centering before estimating the FP powers may result in different powers and
should be avoided. Also see transform_vector_fp()
for some more details.
Details on the subset
argument
Note that subsetting occurs after data pre-processing (shifting and scaling),
but before model selection and fitting. In detail, when the option subset
is
used and scale, shift or centering values are to be estimated, then mfp2()
first estimates these parameters using the full dataset (no subsetting).
It then conduct subsetting before proceeding to perform model selection and
fitting on the specified subset of the data.
Therefore, subsetting in mfp2()
is not equivalent to subsetting the data
before passing it to mfp2()
, and thus cannot be used to implement, for example,
cross-validation or to remove NA
. These tasks should be done by the caller
beforehand. However, it does allow to use the same data pre-processing
for different subsets of the data. An example use case is when separate
models are to be estimated for women and men in the dataset, but a common
data pre-processing should be applied. In this case the subset
option
can be used to restrict model selection to either women or men, but the
data processing (e.g. shifting factors) will be shared between the two models.
Details on approximate cumulative distribution transformation
The approximate cumulative distribution (ACD) transformation (Royston 2014)
converts each predictor, x
, smoothly to an approximation, acd(x)
,
of its empirical cumulative distribution function.
This is done by smoothing a probit transformation of
the scaled ranks of x
. acd(x)
could be used instead of x
as a covariate. This has the advantage of providing sigmoid curves, something
that regular FP functions cannot achieve.
Details of the precise definition and some possible uses of the ACD
transformation in a univariate context are given by Royston (2014).
Royston and Sauerbrei (2016) describes how one could go further and replace FP2
functions with a pair of FP1 functions, one in x
and the other in
acd(x)
.
This alternative class of four-parameter functions provides about
the same flexibility as the standard FP2 family, but the ACD component offers
the additional possibility of sigmoid functions.
Royston (2014) discusses how the extended class of functions known as
FP1(p1, p2)
, namely
FP1(p1, p2) = \beta_1 x^{p1} + \beta_2 acd(x)^{p2}
can be fitted optimally by seeking the best combination of all 64 pairs of
powers (p1, p2). The optimisation is invoked by use of the acdx
parameter.
Royston (2014) also described simplification of the chosen function through
model reduction by applying significance testing to six sub-families of
functions,M1-M6, giving models M1 (most complex) through M6 (null):
M1: FP1(p1, p2) (no simplification)
M2: FP1(p1, .) (regular FP1 function of
x
)M3: FP1(., p2) (regular FP1 function of
acd(x)
)M4: FP1(1, .) (linear function of
x
)M5: FP1(., 1) (linear function of
acd(x)
)M6: Null (
x
omitted entirely)
Selection among these six sub-functions is performed by a closed test
procedure known as the function-selection pocedure FSPA.
It maintains the family-wise type 1 error
probability for selecting x
at the value determined by the
select
parameter. To obtain a 'final' model, a structured sequence of up
to five tests is carried out, the first at the significance level specified
by the select
parameter, and the remainder at the significance level
provided by the alpha
option.
The sequence of tests is as follows:
Test 1: Compare the deviances of models 6 and 1 on 4 d.f. If not significant then stop and omit
x
, otherwise continue to step 2.Test 2: Compare the deviances of models 4 and 1 on 3 d.f. If not significant then accept model 4 and stop. Otherwise, continue to step 3.
Test 3: Compare the deviance of models 2 and 1 on 2 d.f. If not significant then accept model 2 and stop. Otherwise continue to step 4.
Test 4: Compare the deviance of models 3 and 1 on 2 d.f. If significant then model 1 cannot be simplified; accept model 1 and stop. Otherwise continue to step 5.
Test 5: Compare the deviances of models 5 and 3 on 1 d.f. If significant then model 3 cannot be simplified; accept model 3. Otherwise, accept model 5. End of procedure.
The result is the selection of one of the six models.
Details on model specification using a formula
mfp2
supports model specifications using two different interfaces: one
which allows passing of the data matrix x
and outcome vector y
directly
(as done in e.g. stats::glm.fit()
or glmnet
) and another which conforms
to the formula interface used by many commonly used R modelling functions
such as stats::glm()
or survival::coxph()
.
Both interfaces are equivalent in terms of possible fitted models, only the
details of specification differ. In the standard interface all details
regarding FP-transformations are given as vectors. In the formula interface
all details are specified using special fp()
function. These support the
specification of degrees of freedom (df
), nominal significance level for
variable selection (select
), nominal significance level for functional form
selection (alpha
), shift values (shift
), scale values (scale
),
centering (center
) and the ACD-transformation (acd
). Values specified
through fp()
function override the values specified as defaults and passed to
the mfp2()
function.
The formula may also contain strata
terms to fit stratified Cox models, or
an offset
term to specify a model offset.
Note that for a formula using .
, such as y ~ .
the mfp2()
function may not
fit a linear model, but may perform variable and functional form selection
using FP-transformations, depending on the default settings of df
,
select
and alpha
passed as arguments to mfp2()
.
For example, using y ~ .
with default settings means that mfp2()
will
apply FP transformation with 4 df to all continuous variables and use alpha
equal to 0.05 to select functional forms, along with the selection algorithm
with a significance level of 0.05 for all variables.
Compatibility with mfp
package
mfp2
is an extension of the mfp
package and can be used to reproduce
the results from a model fitted by mfp
. Since both packages implement the
MFP algorithm, they use functions with the same names (e.g fp()
). Therefore,
if you load both packages using a call to library
, there will
be namespace conflicts and only the functions from the package loaded last
will work properly.
Convergence and Troubleshooting
Typically, mfp2
requires two to four cycles to achieve convergence. Lack of
convergence involves oscillation between two or more models and is extremely
rare. If the model does not converge, you can try changing the nominal
significance levels for variable (select
) or function selection (alpha
).
References
Royston, P. and Sauerbrei, W., 2008. Multivariable Model - Building:
A Pragmatic Approach to Regression Anaylsis based on Fractional Polynomials
for Modelling Continuous Variables. John Wiley & Sons.
Sauerbrei, W., Meier-Hirmer, C., Benner, A. and Royston, P., 2006.
Multivariable regression model building by using fractional
polynomials: Description of SAS, STATA and R programs.
Comput Stat Data Anal, 50(12): 3464-85.
Royston, P. 2014. A smooth covariate rank transformation for use in
regression models with a sigmoid dose-response function.
Stata Journal 14(2): 329-341.
Royston, P. and Sauerbrei, W., 2016. mfpa: Extension of mfp using the
ACD covariate transformation for enhanced parametric multivariable modeling.
The Stata Journal, 16(1), pp.72-87.
Sauerbrei, W. and Royston, P., 1999. Building multivariable prognostic and diagnostic models: transformation of the predictors by using fractional polynomials. J Roy Stat Soc a Sta, 162:71-94.
See Also
summary.mfp2()
, coef.mfp2()
, predict.mfp2()
, fp()
Examples
# Gaussian model
data("prostate")
x = as.matrix(prostate[,2:8])
y = as.numeric(prostate$lpsa)
# default interface
fit1 = mfp2(x, y, verbose = FALSE)
fit1$fp_terms
fracplot(fit1) # generate plots
summary(fit1)
# formula interface
fit1b = mfp2(lpsa ~ fp(age) + fp(svi, df = 1) + fp(pgg45) + fp(cavol) + fp(weight) +
fp(bph) + fp(cp), data = prostate)
# logistic regression model
data("pima")
xx <- as.matrix(pima[, 2:9])
yy <- as.vector(pima$y)
fit2 <- mfp2(xx, yy, family = "binomial", verbose = FALSE)
fit2$fp_terms
# Cox regression model
data("gbsg")
# create dummy variable for grade using ordinal coding
gbsg <- create_dummy_variables(gbsg, var_ordinal = "grade", drop_variables = TRUE)
xd <- as.matrix(gbsg[, -c(1, 6, 10, 11)])
yd <- survival::Surv(gbsg$rectime, gbsg$censrec)
# fit mfp and keep hormon in the model
fit3 <- mfp2(xd, yd, family = "cox", keep = "hormon", verbose = FALSE)
fit3$fp_terms