longdat_cont {LongDat} | R Documentation |
Longitudinal analysis with time as continuous variable
Description
longdat_cont calculates the p values, effect sizes and discover covariate effects of time variables from longitudinal data.
Usage
longdat_cont(
input,
data_type,
test_var,
variable_col,
fac_var,
not_used = NULL,
adjustMethod = "fdr",
model_q = 0.1,
posthoc_q = 0.05,
theta_cutoff = 2^20,
nonzero_count_cutoff1 = 9,
nonzero_count_cutoff2 = 5,
verbose = TRUE
)
Arguments
input |
A data frame with the first column as "Individual" and all the columns of dependent variables (features, e.g. bacteria) at the end of the table. The time variable here should be continuous, if time is discrete, please apply longdat_disc() instead. Please avoid using characters that don't belong to ASCII printable characters for potential covariates names (covariates are any column apart from individual, test_var and dependent variables). |
data_type |
The data type of the dependent variables (features). Can either be "proportion", "measurement", "count", "binary", "ordinal" or "others". Proportion (or ratio) data range from 0 to 1. Measurement data are continuous and can be measured at finer and finer scale (e.g. weight). Count data consist of discrete non-negative integers resulted from counting. Binary data are the data of sorting things into one of two mutually exclusive categories. Ordinal data consist of ranks. Any data that doesn't belong to the previous categories should be classified as "others". |
test_var |
The name of the independent variable you are testing for, should be a string (e.g. "Time") identical to its column name and make sure there is no space in it. |
variable_col |
The column number of the position where the dependent variable columns (features, e.g. bacteria) start in the table. |
fac_var |
The column numbers of the position where the columns that aren't numerical (e.g. characters, categorical numbers, ordinal numbers). This should be a numerical vector (e.g. c(1, 2, 5:7)). |
not_used |
The column position of the columns not are irrelevant and can be ignored when in the analysis. This should be a numerical vector, and the default is NULL. |
adjustMethod |
Multiple testing p value correction. Choices are the ones in p.adjust(), including 'holm', 'hochberg', 'hommel', 'bonferroni', 'BH', 'BY' and 'fdr.' The default is 'fdr'. |
model_q |
The threshold for significance of model test after multiple testing correction. The default is 0.1. |
posthoc_q |
The threshold for significance of post-hoc test after multiple testing correction. The default is 0.05. |
theta_cutoff |
Required when the data type is set as "count". Variable with theta value from negative binomial regression larger than or equal to the cutoff will be filtered out if it also doesn't meet the non-zero count threshold. Users can use the function "theta_plot()" to help with specifying the value for theta_cutoff. The default is 2^20. |
nonzero_count_cutoff1 |
Required when the data type is set as "count". Variable with non-zero counts lower than or equal to this value will be filtered out if it doesn't meet the theta threshold either. Users can use the function "theta_plot()" to help with specifying the value for nonzero_count_cutoff1. The default is 9. |
nonzero_count_cutoff2 |
Required when the data type is set as "count". Variable with non-zero counts lower than or equal to this value will be filtered out. Users can use the function "theta_plot()" to help with specifying the value for nonzero_count_cutoff2. The default is 5. |
verbose |
A boolean vector indicating whether to print detailed message. The default is TRUE. |
Details
The brief workflow of longdat_cont() is as below:
When there's no potential covariates in the input data (covariates are anything apart from individual, test_var and dependent variables): First, the model test tests the significance of test_var on dependent variables. Different generalized linear mixed effect models are implemented for different types of dependent variable. Negative binomial mixed model for "count", linear mixed model (dependent variables normalized first) for "measurement", beta mixed model for "proportion", binary logistic mixed model for "binary", and proportional odds logistic mixed model for "ordinal". Then, post-hoc test (Spearman's correlation test) on the model is done. When the data type is "count" mode, a control model test will be run on randomized data (the rows are shuffled). If there are false positive signals in this control model test, users will get a warning at the end of the run.
When there are potential covariates in the input data: After the model test and post-hoc test described above, a covariate model test will be added to the work flow. The potential covariates will be added to the model one by one and test for its significance on each dependent variable. The rest are the same as the description above.
Also, when your data type is count data, please use set.seed() before running longdat_cont() so that you can get reproducible randomized negative check.
Value
longdat_cont() returns a list which contains a "Result_table", and if there are covariates in the input data frame, there will be another table called "Covariate_table". For count mode, if there is any false positive in the randomized control result, then another table named "Randomized_control_table" will also be generated. The detailed description is as below.
Result_table
1. The first column: The dependent variables in the input data. This can be used as row name when being imported into R.
2. Prevalence_percentage: The percentage of each dependent variable present across individuals and time points
3. Mean_abundance: The mean value of each dependent variable across individuals and time points
4. Signal: The final decision of the significance of the test_var (independent variable) on each dependent variable. NS: This represents "Non-significant", which means that there’s no effect of time.
OK_nc: This represents "OK and no covariate". There’s an effect of time and there’s no potential covariate.
OK_d: This represents "OK but doubtful". There’s an effect of time and there’s no potential covariate, however the confidence interval of the test_var estimate in the model test covers zero, and thus it is doubtful of this signal.
OK_nrc: This represents "OK and not reducible to covariate". There are potential covariates, however there’s an effect of time and it is independent of those of covariates.
EC: This represents "Entangled with covariate". There are potential covariates, and it isn’t possible to conclude whether the effect is resulted from time or covariates.
RC: This represents "Effect reducible to covariate". There’s an effect of time, but it can be reduced to the covariate effects.
5. Effect: This column contains the value of each dependent variable decreases/increases/NS(non-significant) along the time. A positive correlation between with time dependent variable value yields "increase", while a negative correlation yields "decrease". NS means no significant correlation.
6. 'EffectSize': This column reports the correlation coefficient (Spearman's rho) between each dependent variable value and time.
7. Null_time_model_q: This column shows the multiple-comparison-adjusted p values (Wald test) of the significance of test_var in the models.
8. Post-hoc_q: These are the multiple-comparison-adjusted p values from the post-hoc test (Spearman's correlation test) of the model.
Covariate_table
The first column contains the dependent variables in the input data. This can be used as row name when being imported into R. Then every 3 columns are a group. Covariate column shows the covariate's name; Covariate column shows the covariate's name; Covariate_type column shows how effect is affected by covariate ; Effect_size column shows the effect size of dependent variable value between different values of covariate. Due to the different number of covariates for each dependent variable, there may be NAs in the table and they can simply be ignored. If the covariate table is totally empty, this means that there are no covariates detected.
Randomized_control_table (for user's reference)
We assume that there shouldn't be positive results in the randomized control test, because all the rows in the original dataset are shuffled randomly. Therefore, any signal that showed significance here will be regarded as false positive. And if there's false positive in this randomized control result, longdat_disc will warn the user at the end of the run. This Randomized_control table is only generated when there is false positive in the randomized control test. It is intended to be a reference for users to see the effect size of false positive features.
1. The first column "Model_q" shows the multiple-comparison-adjusted p values (Wald test) of the significance of test_var in the negative- binomial models in the randomized dataset. Only the features with Model_q lower than the defined model_q (default = 0.1) will be listed in this table.
2. Signal: This column describes if test_var is significant on each dependent variable based on the post-hoc test p values (Spearman's correlation test). "False positive" indicates that test_var is significant, while "Negative" indicates non-significance.
3. 'Posthoc_q': This column describes the multiple-comparison-adjusted p values from the post-hoc test (Spearman's correlation test) of the model in the randomized control dataset.
4. Effect_size: This column describes the correlation coefficient (Spearman's rho) of each dependent variable between each dependent variable value and time.
Normalize_method (for user's reference)
When data_type is either "measurement" or "others", this table shows the normalization method used for each feature. Please refer to "Using the bestNormalize Package" on the Internet for the details of each method. "NA" indicates that there are too few data points to interpolate, and thus no normalization was done.
Examples
test_cont <- suppressWarnings(longdat_cont(input = LongDat_cont_master_table,
data_type = "count", test_var = "Day",
variable_col = 7, fac_var = c(1, 3)))