longdat_disc {LongDat} | R Documentation |
Longitudinal analysis with time as discrete variable
Description
longdat_disc calculates the p values, effect sizes and discover covariate effects of time variables from longitudinal data.
Usage
longdat_disc(
input,
data_type,
test_var,
variable_col,
fac_var,
not_used = NULL,
adjustMethod = "fdr",
model_q = 0.1,
posthoc_q = 0.05,
theta_cutoff = 2^20,
nonzero_count_cutoff1 = 9,
nonzero_count_cutoff2 = 5,
verbose = TRUE
)
Arguments
input |
A data frame with the first column as "Individual" and all the columns of dependent variables (features, e.g. bacteria) at the end of the table. The time variable here should be discrete, if time is continuous, please apply longdat_cont() instead. Please avoid using characters that don't belong to ASCII printable characters for potential covariates names (covariates are any column apart from individual, test_var and dependent variables). |
data_type |
The data type of the dependent variables (features). Can either be "proportion", "measurement", "count", "binary", "ordinal" or "others". Proportion (or ratio) data range from 0 to 1. Measurement data are continuous and can be measured at finer and finer scale (e.g. weight). Count data consist of discrete non-negative integers resulted from counting. Binary data are the data of sorting things into one of two mutually exclusive categories. Ordinal data consist of ranks. Any data that doesn't belong to the previous categories should be classified as "others". |
test_var |
The name of the independent variable you are testing for, should be a string (e.g. "Time") identical to its column name and make sure there is no space in it. |
variable_col |
The column number of the position where the dependent variable columns (features, e.g. bacteria) start in the table. |
fac_var |
The column numbers of the position where the columns that aren't numerical (e.g. characters, categorical numbers, ordinal numbers). This should be a numerical vector (e.g. c(1, 2, 5:7)). |
not_used |
The column position of the columns not are irrelevant and can be ignored when in the analysis. This should be a numerical vector, and the default is NULL. |
adjustMethod |
Multiple testing p value correction. Choices are the ones in p.adjust(), including 'holm', 'hochberg', 'hommel', 'bonferroni', 'BH', 'BY' and 'fdr'. The default is 'fdr'. |
model_q |
The threshold for significance of model test after multiple testing correction. The default is 0.1. |
posthoc_q |
The threshold for significance of post-hoc test of the model after multiple testing correction. The default is 0.05. |
theta_cutoff |
Required when the data type is set as "count". Variable with theta value from negative binomial regression larger than or equal to the cutoff will be filtered out if it also doesn't meet the non-zero count threshold. Users can use the function "theta_plot()" to help with specifying the value for theta_cutoff. The default is 2^20. |
nonzero_count_cutoff1 |
Required when the data type is set as "count". Variable with non-zero counts lower than or equal to this value will be filtered out if it doesn't meet the theta threshold either. Users can use the function "theta_plot()" to help with specifying the value for nonzero_count_cutoff1. The default is 9. |
nonzero_count_cutoff2 |
Required when the data type is set as "count". Variable with non-zero counts lower than or equal to this value will be filtered out. Users can use the function "theta_plot()" to help with specifying the value for nonzero_count_cutoff2. The default is 5. |
verbose |
A boolean vector indicating whether to print detailed message. The default is TRUE. |
Details
The brief workflow of longdat_disc() is as below:
When there's no potential covariates in the input data (covariates are anything apart from individual, test_var and dependent variables): First, the model test tests the significance of test_var on dependent variables. Different generalized linear mixed effect models are implemented for different types of dependent variable. Negative binomial mixed model for "count", linear mixed model (dependent variables normalized first) for "measurement", beta mixed model for "proportion", binary logistic mixed model for "binary", and proportional odds logistic mixed model for "ordinal". Then, post-hoc test ('emmeans') on the model is done. When the data type is "count" mode, a control model test will be run on randomized data (the rows are shuffled). If there are false positive signals in this control model test, then additional Wilcoxon post-hoc test will be done because it is more conservative.
When there are potential covariates in the input data: After the model test and post-hoc test described above, a covariate model test will be added to the work flow. The potential covariates will be added to the model one by one and test for its significance on each dependent variable. The rest are the same as the description above.
Also, when your data type is count data, please use set.seed() before running longdat_disc() so that you can get reproducible randomized negative check.
Value
longdat_disc() returns a list which contains a "Result_table", and if there are covariates in the input data frame, there will be another table called "Covariate_table". For count mode, if there is any false positive in the randomized control result, then another table named "Randomized_control_table" will also be generated. The detailed description is as below.
Result_table
1. The first column: The dependent variables in the input data. This can be used as row name when being imported into R.
2. Prevalence_percentage: The percentage of each dependent variable present across individuals and time points.
3. Mean_abundance: The mean value of each dependent variable across individuals and time points.
4. Signal: The final decision of the significance of the test_var (independent variable) on each dependent variable. NS: This represents "Non-significant", which means that there’s no effect of time.
OK_nc: This represents "OK and no covariate". There’s an effect of time and there’s no potential covariate.
OK_d: This represents "OK but doubtful". There’s an effect of time and there’s no potential covariate, however the confidence interval of the test_var estimate in the model test covers zero, and thus it is doubtful of this signal.
OK_nrc: This represents "OK and not reducible to covariate". There are potential covariates, however there’s an effect of time and it is independent of those of covariates.
EC: This represents "Entangled with covariate". There are potential covariates, and it isn’t possible to conclude whether the effect is resulted from time or covariates.
RC: This represents "Effect reducible to covariate". There’s an effect of time, but it can be reduced to the covariate effects.
5. 'Effect_a_b': The "a" and "b" here are the names of the time points. These columns describe the value of each dependent variable decreases/increases/NS(non-significant) at time point b comparing with time point a. The number of Effect columns depends on how many combinations of time points in the input data.
6. 'EffectSize_a_b': The "a" and "b" here are the names of the time points. These columns describe the effect size (Cliff's delta) of each dependent variable between time point b and a. The number of 'EffectSize' columns depends on how many combinations of time points in the input data.
7. 'Null_time_model_q': This column shows the multiple-comparison-adjusted p values (Wald test) of the significance of test_var in the models.
8. 'Post-hoc_q_a_b': The "a" and "b" here are the names of the time points. These are the multiple-comparison-adjusted p values from the post-hoc test of the model. The number of Post-hoc_q columns depends on how many combinations of time points in the input data.
9. 'Wilcox_p_a_b': The "a" and "b" here are the names of the time points. These columns only appear when data type is "count" and there exist false positives in the model test on randomized data. Wilcoxon test are more conservative than the default post-hoc test ('emmeans'), and thus it is a good reference for getting a more conservative result of the significant outcomes.
Covariate_table
The first column contains the dependent variables in the input data. This can be used as row name when being imported into R. Then every 3 columns are a group. Covariate column shows the covariate's name; Covariate_type column shows how effect is affected by covariate; Effect_size column shows the effect size of dependent variable value between different values of covariate. Due to the different number of covariates for each dependent variable, there may be NAs in the table and they can simply be ignored. If the covariate table is totally empty, this means that there are no covariates detected.
Randomized_control_table (for user's reference)
We assume that there shouldn't be positive results in the randomized control test, because all the rows in the original dataset are shuffled randomly. Therefore, any signal that showed significance here will be regarded as false positive. And if there's false positive in this randomized control result, longdat_disc() will warn the user at the end of the run. This Randomized_control table is only generated when there is false positive in the randomized control test. It is intended to be a reference for users to see the effect size of false positive features.
1. "Model_q": It shows the multiple-comparison-adjusted p values (Wald test)of the significance of test_var in the negative-binomial models in the randomized dataset. Only the features with Model_q lower than the defined model_q (default = 0.1) will be listed in this table.
2. Final_signal: It show the overall signal being either false positive or negative. "False positive" indicates that test_var is significant, while "Negative" indicates non-significance.
3. 'Signal_a_b': The "a" and "b" here are the names of the time points. These columns describe if test_var is significant on each dependent variable between each time point based on the post-hoc test p values (listed right to Signal_a_b). "False positive" indicates that test_var is significant, while "Negative" indicates non-significance. The number of Signal_a_b columns depends on how many combinations of time points in the input data.
4. 'Posthoc_q_a_b': The "a" and "b" here are the names of the time points. These columns describe the multiple-comparison-adjusted p values from the post-hoc test of the model between time point b and a in the randomized control dataset. The number of 'Posthoc_q_a_b' columns depends on how many combinations of time points in the input data.
5. 'Effect_size_a_b': The "a" and "b" here are the names of the time points. These columns describe the effect size (Cliff's delta) of each dependent variable between time point b and a in the randomized control dataset. The number of Effect_size_a_b columns depends on how many combinations of time points in the input data.
Normalize_method (for user's reference)
When data_type is either "measurement" or "others", this table shows the normalization method used for each feature. Please refer to "Using the bestNormalize Package" on the Internet for the details of each method. "NA" indicates that there are too few data points to interpolate, and thus no normalization was done.
Examples
test_disc <- longdat_disc(input = LongDat_disc_master_table,
data_type = "count", test_var = "Time_point",
variable_col = 7, fac_var = c(1:3))