R: Univariate analysis of features

univariateRankVariables {FRESA.CAD}

R Documentation

Univariate analysis of features

Description

This function reports the mean and standard deviation for each feature in a model, and ranks them according to a user-specified score. Additionally, it does a Kolmogorov-Smirnov (KS) test on the raw and z-standardized data. It also reports the raw and z-standardized t-test score, the p-value of the Wilcoxon rank-sum test, the integrated discrimination improvement (IDI), the net reclassification improvement (NRI), the net residual improvement (NeRI), and the area under the ROC curve (AUC). Furthermore, it reports the z-value of the variable significance on the fitted model.

Usage

	univariateRankVariables(variableList,
	                        formula,
	                        Outcome,
	                        data, 
	                        categorizationType = c("Raw",
	                                               "Categorical",
	                                               "ZCategorical",
	                                               "RawZCategorical",
	                                               "RawTail",
	                                               "RawZTail",
	                                               "Tail",
	                                               "RawRaw"), 
	                        type = c("LOGIT", "LM", "COX"), 
	                        rankingTest = c("zIDI",
	                                        "zNRI",
	                                        "IDI",
	                                        "NRI",
	                                        "NeRI",
	                                        "Ztest",
	                                        "AUC",
	                                        "CStat",
	                                        "Kendall"), 
	                        cateGroups = c(0.1, 0.9),
	                        raw.dataFrame = NULL,
	                        description = ".",
	                        uniType = c("Binary","Regression"),
	                        FullAnalysis=TRUE,
	                        acovariates = NULL,
	                        timeOutcome = NULL
)

Arguments

`variableList`	A data frame with the candidate variables to be ranked
`formula`	An object of class `formula` with the formula to be fitted
`Outcome`	The name of the column in `data` that stores the variable to be predicted by the model
`data`	A data frame where all variables are stored in different columns
`categorizationType`	How variables will be analyzed: As given in `data` ("Raw"); broken into the p-value categories given by `cateGroups` ("Categorical"); broken into the p-value categories given by `cateGroups`, and weighted by the z-score ("ZCategorical"); broken into the p-value categories given by `cateGroups`, weighted by the z-score, plus the raw values ("RawZCategorical"); raw values, plus the tails ("RawTail"); or raw values, weighted by the z-score, plus the tails ("RawZTail")
`type`	Fit type: Logistic ("LOGIT"), linear ("LM"), or Cox proportional hazards ("COX")
`rankingTest`	Variables will be ranked based on: The z-score of the IDI ("zIDI"), the z-score of the NRI ("zNRI"), the IDI ("IDI"), the NRI ("NRI"), the NeRI ("NeRI"), the z-score of the model fit ("Ztest"), the AUC ("AUC"), the Somers' rank correlation ("Cstat"), or the Kendall rank correlation ("Kendall")
`cateGroups`	A vector of percentiles to be used for the categorization procedure
`raw.dataFrame`	A data frame similar to `data`, but with unadjusted data, used to get the means and variances of the unadjusted data
`description`	The name of the column in `variableList` that stores the variable description
`uniType`	Type of univariate analysis: Binary classification ("Binary") or regression ("Regression")
`FullAnalysis`	If FALSE it will only order the features according to its z-statistics of the linear model
`acovariates`	the list of covariates
`timeOutcome`	the name of the Time to event feature

Details

This function will create valid dummy categorical variables if, and only if, data has been z-standardized. The p-values provided in cateGroups will be converted to its corresponding z-score, which will then be used to create the categories. If non z-standardized data were to be used, the categorization analysis would return wrong results.

Value

A sorted data frame. In the case of a binary classification analysis, the data frame will have the following columns:

`Name`	Name of the raw variable or of the dummy variable if the data has been categorized
`parent`	Name of the raw variable from which the dummy variable was created
`descrip`	Description of the parent variable, as defined in `description`
`cohortMean`	Mean value of the variable
`cohortStd`	Standard deviation of the variable
`cohortKSD`	D statistic of the KS test when comparing a normal distribution and the distribution of the variable
`cohortKSP`	Associated p-value to the `cohortKSD`
`caseMean`	Mean value of cases (subjects with `Outcome` equal to 1)
`caseStd`	Standard deviation of cases
`caseKSD`	D statistic of the KS test when comparing a normal distribution and the distribution of the variable only for cases
`caseKSP`	Associated p-value to the `caseKSD`
`caseZKSD`	D statistic of the KS test when comparing a normal distribution and the distribution of the z-standardized variable only for cases
`caseZKSP`	Associated p-value to the `caseZKSD`
`controlMean`	Mean value of controls (subjects with `Outcome` equal to 0)
`controlStd`	Standard deviation of controls
`controlKSD`	D statistic of the KS test when comparing a normal distribution and the distribution of the variable only for controls
`controlKSP`	Associated p-value to the `controlsKSD`
`controlZKSD`	D statistic of the KS test when comparing a normal distribution and the distribution of the z-standardized variable only for controls
`controlZKSP`	Associated p-value to the `controlsZKSD`
`t.Rawvalue`	Normal inverse p-value (z-value) of the t-test performed on `raw.dataFrame`
`t.Zvalue`	z-value of the t-test performed on `data`
`wilcox.Zvalue`	z-value of the Wilcoxon rank-sum test performed on `data`
`ZGLM`	z-value returned by the `lm`, `glm`, or `coxph` functions for the `z`-standardized variable
`zNRI`	z-value returned by the `improveProb` function (`Hmisc` package) when evaluating the NRI
`zIDI`	z-value returned by the `improveProb` function (`Hmisc` package) when evaluating the IDI
`zNeRI`	z-value returned by the `improvedResiduals` function when evaluating the NeRI
`ROCAUC`	Area under the ROC curve returned by the `roc` function (`pROC` package)
`cStatCorr`	c index of Somers' rank correlation returned by the `rcorr.cens` function (`Hmisc` package)
`NRI`	NRI returned by the `improveProb` function (`Hmisc` package)
`IDI`	IDI returned by the `improveProb` function (`Hmisc` package)
`NeRI`	NeRI returned by the `improvedResiduals` function
`kendall.r`	Kendall `\tau` rank correlation coefficient between the variable and the binary outcome
`kendall.p`	Associated p-value to the `kendall.r`
`TstudentRes.p`	p-value of the improvement in residuals, as evaluated by the paired t-test
`WilcoxRes.p`	p-value of the improvement in residuals, as evaluated by the paired Wilcoxon rank-sum test
`FRes.p`	p-value of the improvement in residual variance, as evaluated by the F-test
`caseN_Z_Low_Tail`	Number of cases in the low tail
`caseN_Z_Hi_Tail`	Number of cases in the top tail
`controlN_Z_Low_Tail`	Number of controls in the low tail
`controlN_Z_Hi_Tail`	Number of controls in the top tail

In the case of regression analysis, the data frame will have the following columns:

`Name`	Name of the raw variable or of the dummy variable if the data has been categorized
`parent`	Name of the raw variable from which the dummy variable was created
`descrip`	Description of the parent variable, as defined in `description`
`cohortMean`	Mean value of the variable
`cohortStd`	Standard deviation of the variable
`cohortKSD`	D statistic of the KS test when comparing a normal distribution and the distribution of the variable
`cohortKSP`	Associated p-value to the `cohortKSP`
`cohortZKSD`	D statistic of the KS test when comparing a normal distribution and the distribution of the z-standardized variable
`cohortZKSP`	Associated p-value to the `cohortZKSD`
`ZGLM`	z-value returned by the glm or Cox procedure for the z-standardized variable
`zNRI`	z-value returned by the `improveProb` function (`Hmisc` package) when evaluating the NRI
`NeRI`	NeRI returned by the `improvedResiduals` function
`cStatCorr`	c index of Somers' rank correlation returned by the `rcorr.cens` function (`Hmisc` package)
`spearman.r`	Spearman `\rho` rank correlation coefficient between the variable and the outcome
`pearson.r`	Pearson r product-moment correlation coefficient between the variable and the outcome
`kendall.r`	Kendall `\tau` rank correlation coefficient between the variable and the outcome
`kendall.p`	Associated p-value to the `kendall.r`
`TstudentRes.p`	p-value of the improvement in residuals, as evaluated by the paired t-test
`WilcoxRes.p`	p-value of the improvement in residuals, as evaluated by the paired Wilcoxon rank-sum test
`FRes.p`	p-value of the improvement in residual variance, as evaluated by the F-test

Author(s)

Jose G. Tamez-Pena

References

Pencina, M. J., D'Agostino, R. B., & Vasan, R. S. (2008). Evaluating the added predictive ability of a new marker: from area under the ROC curve to reclassification and beyond. Statistics in medicine 27(2), 157-172.

[Package FRESA.CAD version 3.4.8 Index]