R: Approximate a linear model using PCSS

pcsslm {pcsstools}

R Documentation

Approximate a linear model using PCSS

Description

pcsslm approximates a linear model of a combination of variables using precomputed summary statistics.

Usage

pcsslm(formula, pcss = list(), ...)

Arguments

`formula`	an object of class formula whose dependent variable is a combination of variables and logical \| operators. All model terms must have appropriate PCSS in `pcss`.
`pcss`	a list of precomputed summary statistics. In all cases, this should include `n`: the sample size, `means`: a named vector of predictor and response means, and `covs`: a named covariance matrix including all predictors and responses. See Details for more information.
`...`	additional arguments. See Details for more information.

Details

pcsslm parses the input formula's dependent variable for functions such as sums (+), products (*), or logical operators (| and &). It then identifies models the combination of variables using one of model_combo, model_product, model_or, model_and, or model_prcomp.

Different precomputed summary statistics are needed inside pcss depending on the function that combines the dependent variable.

For linear combinations (and principal component analysis), only n, means, and covs are required
For products and logical combinations, the additional items predictors and responses are required. These are named lists of objects of class predictor generated by new_predictor, with a predictor object for each independent variable in predictors and each dependent variable in responses. However, if only modeling the product or logical combination of only two variables, responses can be NULL without consequence.

If modeling a principal component score of a set of variables, include the argument comp where comp is an integer indicating which principal component score to analyze. Optional logical arguments center and standardize determine if responses should be centered and standardized before principal components are calculated.

If modeling a linear combination, include the argument phi, a named vector of linear weights for each variable in the dependent variable in formula.

If modeling a product, include the argument response, a character equal to either "continuous" or "binary". If "binary", specialized approximations are performed to estimate means and variances.

Value

an object of class "pcsslm".

An object of class "pcsslm" is a list containing at least the following components:

`call`	the matched call
`terms`	the `terms` object used
`coefficients`	a `p x 4` matrix with columns for the estimated coefficient, its standard error, t-statistic and corresponding (two-sided) p-value.
`sigma`	the square root of the estimated variance of the random error.
`df`	degrees of freedom, a 3-vector `p, n-p, p*`, the first being the number of non-aliased coefficients, the last being the total number of coefficients.
`fstatistic`	a 3-vector with the value of the F-statistic with its numerator and denominator degrees of freedom.
`r.squared`	`R^2`, the 'fraction of variance explained by the model'.
`adj.r.squared`	the above `R^2` statistic 'adjusted', penalizing for higher `p`.
`cov.unscaled`	a `p x p` matrix of (unscaled) covariances of the `coef[j], j=1,...p`.
`Sum Sq`	a 3-vector with the model's Sum of Squares Regression (SSR), Sum of Squares Error (SSE), and Sum of Squares Total (SST).

References

Wolf JM, Westra J, Tintle N (2021). “Using Summary Statistics to Model Multiplicative Combinations of Initially Analyzed Phenotypes With a Flexible Choice of Covariates.” Frontiers in Genetics, 12, 1962. ISSN 1664-8021, doi:10.3389/fgene.2021.745901, https://www.frontiersin.org/articles/10.3389/fgene.2021.745901/full.

Wolf JM, Barnard M, Xia X, Ryder N, Westra J, Tintle N (2020). “Computationally efficient, exact, covariate-adjusted genetic principal component analysis by leveraging individual marker summary statistics from large biobanks.” Pacific Symposium on Biocomputing, 25, 719–730. ISSN 2335-6928, doi:10.1142/9789811215636_0063, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6907735/.

Gasdaska A, Friend D, Chen R, Westra J, Zawistowski M, Lindsey W, Tintle N (2019). “Leveraging summary statistics to make inferences about complex phenotypes in large biobanks.” Pacific Symposium on Biocomputing, 24, 391–402. ISSN 2335-6928, doi:10.1142/9789813279827_0036, https://pubmed.ncbi.nlm.nih.gov/30963077/.

Examples

## Principal Component Analysis
ex_data <- pcsstools_example[c("g1", "x1", "y1", "y2", "y3")]
pcss <- list(
  means = colMeans(ex_data),
  covs = cov(ex_data),
  n = nrow(ex_data)
)

pcsslm(y1 + y2 + y3 ~ g1 + x1, pcss = pcss, comp = 1)

## Linear combination of variables
ex_data <- pcsstools_example[c("g1", "g2", "y1", "y2")]
pcss <- list(
  means = colMeans(ex_data),
  covs = cov(ex_data),
  n = nrow(ex_data)
)

pcsslm(y1 + y2 ~ g1 + g2, pcss = pcss, phi = c(1, -1))
summary(lm(y1 - y2 ~ g1 + g2, data = ex_data))

## Product of variables
ex_data <- pcsstools_example[c("g1", "x1", "y4", "y5", "y6")]

pcss <- list(
  means = colMeans(ex_data),
  covs = cov(ex_data),
  n = nrow(ex_data),
  predictors = list(
    g1 = new_predictor_snp(maf = mean(ex_data$g1) / 2),
    x1 = new_predictor_normal(mean = mean(ex_data$x1), sd = sd(ex_data$x1))
  ),
  responses = lapply(
    colMeans(ex_data)[3:length(colMeans(ex_data))], 
    new_predictor_binary
  )
)

pcsslm(y4 * y5 * y6 ~ g1 + x1, pcss = pcss, response = "binary")
summary(lm(y4 * y5 * y6 ~ g1 + x1, data = ex_data))

## Disjunct (OR statement) of variables
ex_data <- pcsstools_example[c("g1", "x1", "y4", "y5")]

pcss <- list(
  means = colMeans(ex_data),
  covs = cov(ex_data),
  n = nrow(ex_data),
  predictors = list(
    g1 = new_predictor_snp(maf = mean(ex_data$g1) / 2),
    x1 = new_predictor_normal(mean = mean(ex_data$x1), sd = sd(ex_data$x1))
  )
)
pcsslm(y4 | y5 ~ g1 + x1, pcss = pcss) 
summary(lm(y4 | y5 ~ g1 + x1, data = ex_data))

[Package pcsstools version 0.1.2 Index]