R: Generate familial competing risks data

simfam_cmp {FamEvent}

R Documentation

Generate familial competing risks data

Description

Generates familial competing risks data for specified study design, genetic model and source of residual familial correlation; the generated data frame has the same family structure as that simfam function, including individual's id, father id, mother id, relationship to proband, generation, gender, current age, genotypes of major or second genes.

Usage

simfam_cmp(N.fam, design = "pop+", variation = "none", interaction = FALSE, 
         depend = NULL, base.dist = c("Weibull", "Weibull"), frailty.dist = "none", 
         base.parms = list(c(0.016, 3), c(0.016, 3)), 
         vbeta = list(c(-1.13, 2.35), c(-1, 2)), allelefreq = 0.02, dominant.m = TRUE, 
         dominant.s = TRUE, mrate = 0, hr = 0, probandage = c(45, 2), 
         agemin = 20, agemax = 100)

Arguments

`N.fam`	Number of families to generate.
`design`	Family based study design used in the simulations. Possible choices are: `"pop"`, `"pop+"`, `"cli"`, `"cli+"` or `"twostage"`, where `"pop"` is for the population-based design that families are ascertained by affected probands, `"pop+"` is similar to `"pop"` but with mutation carrier probands, `"cli"` is for the clinic-based design that includes affected probands with at least one parent and one sib affected, `"cli+"` is similar to `"cli"` but with mutation carrier probands and `"twostage"` for two-stage design that randomly samples families from the population in the first stage and oversamples high risk families in the second stage that include at least two affected members in the family. Default is `"pop+"`.
`variation`	Source of residual familial correlation. Possible choices are: `"frailty"` for frailty shared within families, `"secondgene"` for second gene variation, or `"none"` for no residual familial correlation. Default is `"none"`.
`interaction`	Logical; if `TRUE`, allows the interaction between gender and mutation status. Two logical values should be specified for each competing event; if only one logical value is provided, the same logical value will be assumed for both events. Default is `FALSE`.
`depend`	Two values shoud be specified for each competing event when `frailty.dist = "gamma"` or `frailty.dist = "lognormal"`, three values should be specified with `frailty.dist = "cgamma"` or `frailty.dist = "clognormal"`. The first two values represent the inverse of the variance for each competing event and the third value represents the correlation between the two events.
`base.dist`	Choice of baseline hazard distribution. Possible choices are: `"Weibull"`, `"loglogistic"`, `"Gompertz"`, `"lognormal"` `"gamma"`, `"logBurr"`. Default is `"Weibull"`. Two distributions should be specified for each competing event. If only one distribution is specified, the same distribution will be assumed for both events.
`frailty.dist`	Choice of frailty distribution. Possible choices are `"gamma"` for independent gamma, `"lognormal"` for independent lognormal, `"cgamma"` for correlated gamma, or `"clognormal"` for correlated lognormal distribution. Default is `NULL`.
`base.parms`	The list of two vectors of baseline parameters for each event should be specified. For example, `base.parms=list(c(lambda1, rho1), c(lambda2, rho2))` should be specified for `base.dist=c("Weibull", "Weibull")`. Two parameters `base.parms=c(lambda, rho)` should be specified for `base.dist="Weibull"`, `"loglogistic"`, `"Gompertz"`, `"gamma"`, and `"lognormal"`, and three parameters should be specified `base.parms = c(lambda, rho, eta)` for `base.dist="logBurr"`.
`vbeta`	List of two vectors of regression coefficients for each event should be specified. Each vector contains regression coefficients for gender, majorgene, interaction between gender and majorgene (if `interaction = TRUE`), and secondgene (if `variation = "secondgene"`).
`allelefreq`	Population allele frequencies of major disease gene. Value should be between 0 and 1. Vector of population allele frequencies for major and second disease genes should be provided when `variation = "secondgene"`. Default value is `allelefreq = 0.02`.
`dominant.m`	Logical; if `TRUE`, the genetic model of major gene is dominant, otherwise recessive.
`dominant.s`	Logical; if `TRUE`, the genetic model of second gene is dominant, otherwise recessive.
`mrate`	Proportion of missing genotypes, value between 0 and 1. Default value is 0.
`hr`	Proportion of high risk families, which include at least two affected members, to be sampled from the two stage sampling. This value should be specified when `design="twostage"`. Default value is 0. Value should lie between 0 and 1.
`probandage`	Vector of mean and standard deviation for the proband age. Default values are mean of 45 years and standard deviation of 2 years, `probandage = c(45, 2)`.
`agemin`	Minimum age of disease onset or minimum age. Default is 20 years of age.
`agemax`	Maximum age of disease onset or maximum age. Default is 100 years of age.

Details

Competing risk model

Event 1:

h₁(t|X,Z) = h₀₁(t - t₀) Z₁ exp(β_s1 * x_s + β_g1 * x_g),

Event 2:

h₂(t|X,Z) = h₀₂(t - t₀) Z₂ exp(β_s2 * x_s + β_g2 * x_g),

where h₀₁(t) and h₀₂(t) are the baseline hazard functions for event 1 and event 2, respectively, t₀ is a minimum age of disease onset, Z₁ and Z₂ are frailties shared within families for each event and follow either a gamma, log-normal, correlateg gamma, or correlated log-normal distributions, x_x and x_g indicate male (1) or female (0) and carrier (1) or non-carrier (0) of a main gene of interest, respectively.

Choice of frailty distributions for competing risk models

frailty.dist = "gamma" shares the frailties within families generated from a gamma distribution independently for each competing event, where Z_j follows Gamma(k_j, 1/k_j).

frailty.dist = "lognormal" shares the frailties within families generated from a log-normal distribution independently for each competing event, where Z_j follows log-normal distribution with mean 0 and variance (1/k_j.

frailty.dist = "cgamma" shares the frailties within families generated from a correlated gamma distribution to allow the frailties between two events to be correlated, where the correlated gamma frailties (Z₁, Z₂) are generated with three independent gamma frailties (Y₀, Y₁, Y₂) as follows:

Z₁ = k₀/(k₀ + k₁) Y₀ + Y₁ Z₂ = k₀/(k₀ + k₂) Y₀ + Y₂

where Y₀ from Gamma(k₀, 1/k₀);

Y₁

from Gamma(k₁, 1/(k₀ + k₁));

Y₂

from Gamma(k₂, 1/(k₀ + k₂)).

frailty.dist = "clognormal" shares the frailties within families generated from a correlated log-normal distribution where log(Z_j) follows a normal distribution with mean 0, variance 1/k_j and correlation between two events k₀.

depend should specify the values of related frailty parameters: c(k1, k2) with frailty.dist = "gamma" or frailty.dist = "lognormal"; c(k1, k2, k0) for frailty.dist = "cgamma" or frailty.dist = "clognormal".

The current ages for each generation are simulated assuming normal distributions. However, the probands' ages are generated using a left truncated normal distribution as their ages cannot be less than the minimum age of onset. The average age difference between each generation and their parents is specified as 20 years apart.

The design argument defines the type of family based design to be simulated. Two variants of the population-based and clinic-based design can be chosen: "pop" when proband is affected, "pop+" when proband is affected mutation carrier, "cli" when proband is affected and at least one parent and one sibling are affected, "cli+" when proband is affected mutation-carrier and at least one parent and one sibling are affected. The two-stage design, "twostage", is used to oversample high risk families, where the proportion of high risks families to include in the sample is specified by hr. High risk families often include multiple (at least two) affected members in the family.

Note that simulating family data under the clinic-based designs ("cli" or "cli+") or the two-stage design can be slower since the ascertainment criteria for the high risk families are difficult to meet in such settings. Especially, "cli" design could be slower than "cli+" design since the proband's mutation status is randomly selected from a disease population in "cli" design, so his/her family members are less likely to be mutation carriers and have less chance to be affected, whereas the probands are all mutation carriers, their family members have higher chance to be carriers and affected by disease. Therefore, "cli" design requires more iterations to sample high risk families than "cli+" design.

Value

Returns an object of class 'simfam', a data frame which contains:

famID

Family identification (ID) numbers.

indID

Individual ID numbers.

gender

Gender indicators: 1 for males, 0 for females.

motherID

Mother ID numbers.

fatherID

Father ID numbers.

proband

Proband indicators: 1 if the individual is the proband, 0 otherwise.

generation

Individuals generation: 1=parents of probands,2=probands and siblings, 3=children of probands and siblings.

majorgene

Genotypes of major gene: 1=AA, 2=Aa, 3=aa where A is disease gene.

secondgene

Genotypes of second gene: 1=BB, 2=Bb, 3=bb where B is disease gene.

ageonset

Ages at disease onset in years.

currentage

Current ages in years.

time

Ages at disease onset for the affected or ages of last follow-up for the unaffected.

status

Disease statuses: 1 for affected by event 1, 2 for affected by event 2, 0 for unaffected (censored).

mgene

Major gene mutation indicators: 1 for mutated gene carriers, 0 for mutated gene noncarriers, or NA if missing.

relation

Family members' relationship with the proband:

1	Proband (self)
2	Brother or sister
3	Son or daughter
4	Parent
5	Nephew or niece
6	Spouse
7	Brother or sister in law

fsize

Family size including parents, siblings and children of the proband and the siblings.

naff

Number of affected members by either event 1 or 2 within family.

df1

Number of affected members by event 1 within family.

df2

Number of affected members by event 2 within family.

weight

Sampling weights.

Author(s)

Yun-Hee Choi

References

Choi, Y.-H., Briollais, L., He, W. and Kopciuk, K. (2021) FamEvent: An R Package for Generating and Modeling Time-to-Event Data in Family Designs, Journal of Statistical Software 97 (7), 1-30. doi:10.18637/jss.v097.i07.

Choi, Y.-H., Jung, H., Buys, S., Daly, M., John, E.M., Hopper, J., Andrulis, I., Terry, M.B., Briollais, L. (2021) A Competing Risks Model with Binary Time Varying Covariates for Estimation of Breast Cancer Risks in BRCA1 Families, Statistical Methods in Medical Research 30 (9), 2165-2183. https://doi.org/10.1177/09622802211008945.

Choi, Y.-H., Kopciuk, K. and Briollais, L. (2008) Estimating Disease Risk Associated Mutated Genes in Family-Based Designs, Human Heredity 66, 238-251.

Choi, Y.-H. and Briollais (2011) An EM Composite Likelihood Approach for Multistage Sampling of Family Data with Missing Genetic Covariates, Statistica Sinica 21, 231-253.

Examples


## Example 1: simulate competing risk family data from pop+ design using
#  Weibull distribution for both baseline hazards and inducing 
#  residual familial correlation through a correlated gamma frailty.

set.seed(4321)
fam <- simfam_cmp(N.fam = 10, design = "pop+", variation = "frailty", 
       base.dist = "Weibull", frailty.dist = "cgamma", depend=c(1, 2, 0.5), 
       allelefreq = 0.02, base.parms = list(c(0.01, 3), c(0.01, 3)), 
       vbeta = list(c(-1.13, 2.35), c(-1, 2)))


head(fam) 

## Not run: 
  famID indID gender motherID fatherID proband generation majorgene secondgene  ageonset
1     1     1      1        0        0       0          1         3          0 124.23752
2     1     2      0        0        0       0          1         2          0  54.66936
3     1     3      0        2        1       1          2         2          0  32.75208
4     1     4      1        0        0       0          0         3          0 136.44926
5     1    11      1        3        4       0          3         3          0  71.53672
6     1    12      1        3        4       0          3         3          0 152.47073
  currentage     time status true_status mgene relation fsize naff df1 df2 weight
1   65.30602 65.30602      0           2     0        4    25    2   1   1      1
2   68.62107 54.66936      1           1     1        4    25    2   1   1      1
3   47.07842 32.75208      2           2     1        1    25    2   1   1      1
4   45.09295 45.09295      0           2     0        6    25    2   1   1      1
5   25.32819 25.32819      0           1     0        3    25    2   1   1      1
6   22.95059 22.95059      0           2     0        3    25    2   1   1      1

## End(Not run)

summary(fam)

plot(fam, famid = 1) # pedigree plots for family with ID = 1

[Package FamEvent version 3.2 Index]