R: Simulate a Node Using Logistic Regression

node_binomial {simDAG}

R Documentation

Simulate a Node Using Logistic Regression

Description

Data from the parents is used to generate the node using logistic regression by predicting the covariate specific probability of 1 and sampling from a Bernoulli distribution accordingly.

Usage

node_binomial(data, parents, formula=NULL, betas, intercept,
              return_prob=FALSE, coerce2factor=FALSE,
              coerce2numeric=FALSE, labels=NULL)

Arguments

`data`	A `data.table` (or something that can be coerced to a `data.table`) containing all columns specified by `parents`.
`parents`	A character vector specifying the names of the parents that this particular child node has. If non-linear combinations or interaction effects should be included, the user may specify the `formula` argument instead.
`formula`	An optional `formula` object to describe how the node should be generated or `NULL` (default). If supplied it should start with `~`, having nothing else on the left hand side. The right hand side may contain any valid formula syntax, such as `A + B` or `A + B + I(A^2)`, allowing non-linear effects. If this argument is defined, there is no need to define the `parents` argument. For example, using `parents=c("A", "B")` is equal to using `formula= ~ A + B`.
`betas`	A numeric vector with length equal to `parents`, specifying the causal beta coefficients used to generate the node.
`intercept`	A single number specifying the intercept that should be used when generating the node.
`return_prob`	Either `TRUE` or `FALSE` (default). If `TRUE`, the calculated probability is returned instead of the results of bernoulli trials.
`coerce2factor`	Either `TRUE` or `FALSE` (default). If `TRUE`, the resulting vector is coerced to a factor variable. Levels of this factor can be set using the `labels` argument.
`coerce2numeric`	Either `TRUE` or `FALSE` (default). If `TRUE`, the resulting vector is coerced to a numeric variable (0/1).
`labels`	A character vector of length 2 or `NULL` (default). If `NULL`, the resulting vector is returned as is. If a character vector is supplied all `TRUE` values are replaced by the first entry of this vector and all `FALSE` values are replaced by the second argument of this vector. The output will then be a character variable, unless `coerce2factor` is set to `TRUE` in which case it will be a factor variable.

Details

Using the normal form a logistic regression model, the observation specific event probability is generated for every observation in the dataset. Using the rbernoulli function, this probability is then used to take one bernoulli sample for each observation in the dataset. If only the probability should be returned return_prob should be set to TRUE.

Formal Description:

Formally, the data generation can be described as:

Y \sim Bernoulli(logit(\texttt{intercept} + \texttt{parents}_1 \cdot \texttt{betas}_1 + ... + \texttt{parents}_n \cdot \texttt{betas}_n)),

where Bernoulli(p) denotes one Bernoulli trial with success probability p, n is the number of parents (length(parents)) and the logit(x) function is defined as:

logit(x) = ln(\frac{x}{1-x}).

For example, given intercept=-15, parents=c("A", "B") and betas=c(0.2, 1.3) the data generation process is defined as:

Y \sim Bernoulli(logit(-15 + A \cdot 0.2 + B \cdot 1.3)).

Output Format:

By default this function returns a logical vector containing only TRUE and FALSE entries, where TRUE corresponds to an event and FALSE to no event. If those should be coded as 0/1 instead, the user can use the coerce2numeric argument. If they should be coded as a character with specific labels, the user can use the labels argument. To additionally output it as a factor, the user may use the coerce2factor argument. If both coerce2factor and coerce2numeric are set to TRUE, the result will be a factor. The last three arguments of this function are ignored if return_prob is set to TRUE.

Value

Returns a logical vector (or numeric vector if return_prob=TRUE) of length nrow(data).

Author(s)

Robin Denz

Examples

library(simDAG)

set.seed(5425)

# define needed DAG
dag <- empty_dag() +
  node("age", type="rnorm", mean=50, sd=4) +
  node("sex", type="rbernoulli", p=0.5) +
  node("smoking", type="binomial", parents=c("age", "sex"),
       betas=c(1.1, 0.4), intercept=-2)

# simulate data from it
sim_dat <- sim_from_dag(dag=dag, n_sim=100)

# returning only the estimated probability instead
dag <- empty_dag() +
  node("age", type="rnorm", mean=50, sd=4) +
  node("sex", type="rbernoulli", p=0.5) +
  node("smoking", type="binomial", parents=c("age", "sex"),
       betas=c(1.1, 0.4), intercept=-2, return_prob=TRUE)

sim_dat <- sim_from_dag(dag=dag, n_sim=100)

[Package simDAG version 0.1.2 Index]