visstat {visStatistics}R Documentation

Visualization of statistical hypothesis testing based on decision tree

Description

visstat() visualizes the statistical hypothesis testing between the dependent variable (or response) varsample and the independent variable varfactor. varfactor can have more than two features. visstat() runs a decision tree selecting the statistical hypothesis test with the highest statistical power fulfilling the assumptions of the underlying test. For each test visstat() returns a graph displaying the data with the main test statistics in the title and a list with the complete test statistics including eventual post-hoc analysis. The automated workflow is especially suited for browser based interfaces to server-based deployments of R. Implemented tests: lm(),t.test(), wilcox.test(), aov(), kruskal.test(), fisher.test(), chisqu.test(). Implemented tests for normal distribution of standardized residuals: shapiro.test() and ad.test(). Implemented post-hoc tests: TukeyHSD() for aov() and pairwise.wilcox.test() for kruskal.test().

Usage

visstat(
  dataframe,
  varsample,
  varfactor,
  conf.level = 0.95,
  numbers = TRUE,
  minpercent = 0.05,
  graphicsoutput = NULL,
  plotName = NULL,
  plotDirectory = getwd()
)

Arguments

dataframe

data.frame containing at least two columns. Data must be column wise ordered. Contingency tables can be transformed to column wise structure with helper function counts_to_cases(as.data.frame()).

varsample

column name of dependent variable in dataframe, datatype character.

varfactor

column name of independent variable in dataframe, datatype character.

conf.level

confidence level of the interval.

numbers

a logical indicating whether to show numbers in mosaic count plots.

minpercent

number between 0 and 1 indicating minimal fraction of total count data of a category to be displayed in mosaic count plots.

graphicsoutput

saves plot(s) of type "png", "jpg", "tiff" or "bmp" in directory specified in plotDirectory. If graphicsoutput=NULL, no plots are saved.

plotName

graphical output is stored following the naming convention "plotName.graphicsoutput" in plotDirectory. Without specifying this parameter, plotName is automatically generated following the convention "statisticalTestName_varsample_varfactor".

plotDirectory

specifies directory, where generated plots are stored. Default is current working directory.

Details

For the comparison of averages, the following algorithm is implemented: If the p-values of the standardized residuals of shapiro.test() or ks.test() are smaller than 1-conf.level, kruskal.test() resp. wilcox.test() are performed, otherwise the oneway.test() and aov() resp. t.test() are performed and displayed. Exception: If the sample size is bigger than 100, wilcox.test() is never executed,instead always the t.test() is performed (Lumley et al. (2002) <doi:10.1146/annurev.publheath.23.100901.140546>). For the test of independence of count data, Cochran's rule (Cochran (1954) <doi:10.2307/3001666>) is implemented: If more than 20 percent of all cells have a count smaller than 5, fisher.test()is performed and displayed, otherwise chisqu.test(). In both cases case an additional mosaic plot showing Pearson's residuals is generated.

Value

list containing statistics of test with highest statistical power meeting assumptions. All values are returned as invisibly copies. Values can be accessed by assigning a return value to visstat.

Examples


## Kruskal-Wallis rank sum test (calling kruskal.test())
visstat(iris,"Petal.Width", "Species")
visstat(InsectSprays,"count","spray")

## ANOVA (calling aov()) and One-way analysis of means (oneway.test())
anova_npk=visstat(npk,"yield","block")
anova_npk #prints summary of tests

## Welch Two Sample t-test (calling t.test())
visstat(mtcars,"mpg","am") 

## Wilcoxon rank sum test (calling wilcox.test())
grades_gender <- data.frame(
 Sex = as.factor(c(rep("Girl", 20), rep("Boy", 20))),
 Grade = c(19.25, 18.1, 15.2, 18.34, 7.99, 6.23, 19.44, 
           20.33, 9.33, 11.3, 18.2,17.5,10.22,20.33,13.3,17.2,15.1,16.2,17.3,
           16.5, 5.1, 15.25, 17.41, 14.5, 15, 14.3, 7.53, 15.23, 6,17.33, 
           7.25, 14,13.5,8,19.5,13.4,17.5,17.4,16.5,15.6))
visstat(grades_gender,"Grade", "Sex")

## Pearson's Chi-squared test and mosaic plot with Pearson residuals
visstat(counts_to_cases(as.data.frame(HairEyeColor[,,1])),"Hair","Eye")
##2x2 contingency tables with Fisher's exact test and mosaic plot with Pearson residuals
HairEyeColorMaleFisher = HairEyeColor[,,1]
##slicing out a 2 x2 contingency table
blackBrownHazelGreen = HairEyeColorMaleFisher[1:2,3:4]
blackBrownHazelGreen = counts_to_cases(as.data.frame(blackBrownHazelGreen));
fisher_stats=visstat(blackBrownHazelGreen,"Hair","Eye")
fisher_stats #print out summary statistics

## Linear regression
visstat(trees,"Girth","Height")

## Saving the graphical output in directory plotDirectory
## A) saving graphical output of type "png" in temporary directory tempdir() 
##    with default naming convention:
visstat(blackBrownHazelGreen,"Hair","Eye",graphicsoutput = "png",plotDirectory=tempdir()) 
##remove graphical output from plotDirectory
file.remove(file.path(tempdir(),"chi_squared_or_fisher_Hair_Eye.png"))
file.remove(file.path(tempdir(),"mosaic_complete_Hair_Eye.png"))
## B) Specifying pdf as output type: 
visstat(iris,"Petal.Width", "Species",graphicsoutput = "pdf",plotDirectory=tempdir())
##remove graphical output from plotDirectory
file.remove(file.path(tempdir(),"kruskal_Petal_Width_Species.pdf"))
## C) Specifiying plotName overwrites default naming convention
visstat(iris,"Petal.Width","Species",graphicsoutput = "pdf",
plotName="kruskal_iris",plotDirectory=tempdir())
##remove graphical output from plotDirectory
file.remove(file.path(tempdir(),"kruskal_iris.pdf"))

[Package visStatistics version 0.1.1 Index]