PSFormula {PStrata}R Documentation

Set up a model formula for use in PStrata

Description

Set up a model formula for use in PStrata package allowing users to specify the treatment indicator, the post-randomization confounding variables, the outcome variable, and possibly the covariates. For survival outcome, a censoring indicator is also specified. Users can also define (potentially non-linear) transforms of the covariates and include random effects for clusters.

Usage

PSFormula(formula, data)

Arguments

formula

an object of class formula (or one that can be coerced to that class): a symbolic description of the model to be fitted. The details of model specification are given in 'Details'.

data

a data frame containing the variables named in formula.

Details

Two models are required for the principal stratification analysis: the principal stratum model and the outcome model.

General formula structure

For the principal stratum model, the formula argument accepts formulas of the following syntax:

treatment + postrand ~ terms

The treatment variable refers to the name of the binary treatment indicator. The postrand variable refers to the name of the binary post-randomization confounding variable. The terms part includes all of the predictors used for the principal stratum model.

For the outcome model, the formula argument accepts formulas of the similar syntax:

response [+ observed] ~ terms

The response variable refers to the name of the outcome variable. The terms part includes all of the predictors used for the outcome model. The observed variable shall not be used for ordinary response. When the true response is subject to right censoring (also called survival outcome in relevant literature), the response variable should refer to the observed or censored response, and the observed variable should be an indicator of whether the true response is observed. For example, suppose the true time for an event is T and the time of censoring is C, Then, the response variable should refer to \min(T, C), the actual time of the event or censoring, whichever comes earlier, and the indicator observed is 1 if T < C and 0 otherwise.

The terms specified in the principal stratum model and the outcome model can be different.

Multiple post-randomization confounding variables

If multiple post-randomization confounding variables exist, one can specify all of them using the following syntax:

treatment + postrand_1 + postrand_2 + ... + postrand_n ~ terms

The post-randomization confounding variables are provided in place of postrand_1 to postrand_n. Up to this version, all of these variables should be binary indicators. Note that the order of these post-randomization confounding variables will not affect the result of the estimation of the parameters, but it will be important in specifying other parameters, such as strata and ER (see PStrata).

Non-linear transformation of the predictors

The syntax for the predictors follow the conventions as used in link{formula}. The part terms consists of a series of terms concatenated by +, each term being the name of a variable, or the interaction of several variables separated by :.

Apart from + and :, a number of other operators are also useful. The * operator is a short-hand for factor crossing: a*b is interpreted as a + b + a:b. The ^ operator means factor crossing to a specific degree. For example, (a + b + c)^2 is interpreted as (a + b + c) * (a + b + c), which is identical to a + b + c + a:b + a:c + b:c. The - operator removes specified terms, so that (a + b + c)^2 - a:b is identical to a + b + c + a:c + b:c. The - operator can be also used to remove the intercept term, such as x - 1. One can also use x + 0 to remove the intercept term.

Arithmetic expressions such as a + log(b) are also legal. However, arithmetic expressions may contain special symbols that are defined for other use, such as +, *, ^ and -. To avoid confusion, the function I() can be used to bracket portions where the operators should be interpreted in arithmetic sense. For example, in x + I(y + z), the term y + z is interpreted as the sum of y and z.

Group level random effect

When effects assumed to vary across grouping variables are considered, one can specify such effects by adding terms in the form of gterms | group, where group refers to the group indicator (usually a factor), and gterms specifies the terms whose coefficients are group-specific, drawn from a population normal distribution.

The most common situation for group level random effect is to include group-specific intercepts to account for unmeasured confounding. For example, x + y + (1 | g) specifies a model with population predictors x and y, as well as random intercept for each level of g.

For more complex random effect structures, refer to lme4::lmer. However, structures other than simple random intercepts and slopes may lead to unexpected behaviors.

Value

PSFormula returns an object of class PSFormula, which is a list containing for following components.

full_formula

input formula as is

data

input data frame

fixed_eff_formula

input formula with only fixed effects

response_names

character vector with names of variables that appear on the left hand side of input formula

has_random_effect

logical indicating whether random effects are specified in the input formula

has_intercept

logical indicating whether the input formula has an intercept

fixed_eff_names

character vector with names of all variables included as fixed effects

fixed_eff_count

integer indicating the number of variables (factors are converted to and counted as dummy variables)

fixed_eff_matrix

fixed-effect design matrix

random_eff_list

a list containing information for each random effect. Such information is a list with the corresponding design matrix, the term names and the factor levels.

See Also

formula, lmer.

Examples

df <- data.frame(
  X = 1:10, 
  Z = c(0,0,0,0,0,1,1,1,1,1),
  D = c(0,0,0,1,1,1,0,0,1,1),
  R = c(1,1,1,1,2,2,2,3,3,3)
 )
PSFormula(Z + D ~ X + I(X^2) + (1 | R), df)


[Package PStrata version 0.0.5 Index]