site_outliers {bulkQC} | R Documentation |
Identifies site level outliers
Description
Discovers potential site level outliers by using unadjusted and adjusted regression models and standardized difference calculations.
Usage
site_outliers(d0, exclude = c("pid"), siteID = "site", covs = c("age"), threshG = 0.001,
thresh2 = 0.05, threshS = 0.5, n_uniq = 10, n_dec = 4, n_decS = 2)
Arguments
d0 |
A data frame with columns as variables and rows as observations |
exclude |
A vector of names of variables to exclude in outlier identification |
siteID |
The name of the variable in the data frame that identifies sites |
covs |
A vector of covariates to adjust for in the adjusted regression models |
threshG |
P-value threshold for global test equal means across sites |
thresh2 |
P-value threshold for comparison of reference site vs. all other sites |
threshS |
Standardized difference threshold above which a site difference is deemed meaningfully large |
n_uniq |
Number of unique observations of a variable needed for outlier identification to be performed |
n_dec |
Number of decimals to display for p-values in output |
n_decS |
Number of decimals to display for standardized differences in output |
Details
The function compares the distribution of a given variable across sites by first conducting a global test of equal means (without and with adjustment for covariates of interest). Among those variables where the null hypothesis of equal means across sites is rejected, the function then compares each site vs. all other sites using unadjusted and adjusted comparisons. The unadjusted comparisons include a two-sample t-test with equal variance and a standardized difference calculation. The adjusted comparisons include a linear regression model with an indicator variable for reference site and user-specified covariates, and an adjusted standardized difference calculated as the model coefficient for site divided by the model estimated root mean squared error.
Value
overall |
A matrix with rows as variables where global test of equal means is rejected and columns as the corresponding p-values from the unadjusted and adjusted statistical tests |
sitewise_P |
For the variables identified by the global tests (columns), the unadjusted p-values (from two-sample t-test) comparing each site to all other sites (rows). Values above threshold printed as missing. |
sitewise_P_adj |
For the variables identified by the global tests (columns), the adjusted p-values (from linear regression model) comparing each site to all other sites (rows). Values above threshold printed as missing. |
sitewise_StDf |
For the variables identified by the global tests (columns), the unadjusted standardized differences comparing each site to all other sites (rows). Values below threshold printed as missing. |
sitewise_StDf_adj |
For the variables identified by the global tests (columns), the adjusted standardized differences comparing each site to all other sites (rows). Values below threshold printed as missing. |
References
Yang D, Dalton JE. A unified approach to measuring the effect size between two groups using SAS. 2012;6
Examples
data(iris)
iris2 = iris
iris2$temp = rnorm(dim(iris2)[1]) #for covariate adjustment
site_outliers(iris2, site="Species", covs=c("temp"))