Rita {Rita}R Documentation

Rita

Description

R Exploratory Data Analysis (REDA; pronounced "rita") summarizes an input dataset by the M, SD + 5-number summary + third and fourth moments and visualizes the data according to an algorithm or as specified by the user. In addition, Rita will provide the results of one or several normality tests. Lastly, Rita normalizes the dataset with several methods and provides visualizations of the best performing method to the user.

Usage

Rita(
  data,
  test = 1,
  xform = 1,
  alpha = 0.05,
  j = 1,
  autoPlot = T,
  histPlot = F,
  densPlot = F,
  stripPlot = F,
  violinPlot = F,
  xformPlot = F,
  return = T,
  seed = 10
)

Arguments

data

Input dataset (matrix, dataframe, or vector). For a univariate distribution, submit a vector or a subsetted matrix or dataframe. If results for many univariate distributions are desired, submit a matrix or dataframe with each column representing a given variable if all distributions are of the same sample-size. If not, it is recommended to call Rita repeatedly for each variable.

test

Desired normality test (scalar). By default (test = 1), Rita will present the results of the Shapiro-wilk test to the user.

test = 1: Shapiro-Wilk (SW)

test = 2: Kolmogorov-Smirnov/Lilliefors (KSL)

test = 3: Anderson-Darling (AD)

test = 4: Jarque-Bera (JB)

test = 5: D'Agostino Pearson Omnibus (DP)

test = 6: Chi-square test (chiSq)

test = 7: Results of all tests for the best performing transformation

The order of the tests printed corresponds to the order of the variables stored within the input dataset.

xform

Desired normalization method (scalar). By default (xform = 1), Rita will assess which method performs best and (a.) return the transformed data to the user, and (b.) visualize the data according to the settings of the plot argument.

Please note that, per the recommendations of Osborne (2002), a constant is added prior to logarithmic and inverse transformations to ensure that the minimum value is anchored at 1, and prior to the square-root transformation to ensure a left anchor of 0.

Similarly, the arc-sine and logit transformations are applied after converting the units, if needed, to ensure that variables are bounded between 0 and 1.

The "best performing" method is identified by comparing goodness-of-fit to the straight line of the QQ plot for the quantiles of the data normalized by a given method and the standard normal distribution. If a tie is present between transformations for a variable, one of the best performing transformations is arbitrarily selected.

xform = 1: Best performing method is presented (excluding the Rankit)

xform = 2: Logarithmic transform

xform = 3: Inverse/reciprocal transform

xform = 4: Square-root transform

xform = 5: Arc-sine transform

xform = 6: Logit transform

xform = 7: Rankit transform

alpha

The two-sided decision threshold used for normality hypothesis-testing (scalar)

j

The # hypotheses tested; used to compute a Bonferonni correction, if applicable; should remain at its default if multiple testing is not an issue (scalar)

autoPlot

Desired plotting method (boolean). By default (plot = 1), the visualization will be implicitly chosen based on extracted features of the dataset.

When autoPlot = F, values of additional plotting arguments are used to determine the visualizations provided to the user.

When autoPlot = T:

Histograms are always generated for discrete data.

Density plots are always generated for continuous data.

Strip plots are generated when the # distinct values are <= 20 AND the # datapoints are 15 <= x <= 150.

Violin plots are instead generated in lieu of the strip plots created when the above conditions are not met.

Lastly, density plots for each (transformed*) variable are generated.

*Transformed variables correspond to the choice made by the user for the xform argument or to the best-performing transformation for each variable when xform = 1.

All plots are drawn in the R console and saved as plotting objects.

histPlot

Whether to generate histograms for each variable (boolean).

densPlot

Whether to generate density plots for each variable (boolean).

stripPlot

Whether to draw strip plots for each variable (boolean).

violinPlot

Whether to draw violin plots for each variable (boolean).

xformPlot

Whether to draw density plots for each transformed variable (boolean).

return

Whether to return the transformed variables of the best performing method (return = T; default), or the cleaned, untransformed variables eligible for transformation (return = F) (boolean).

seed

Number used for reproduction of random number generator results (scalar).

Details

Any rows with missing values (NAs) are removed for calculation purposes; if desired, incomplete records should be imputed or removed with subsetting prior to calling Rita. In addition, note that any columns not numeric type or coercible to numeric are excluded from analysis, as are any numeric columns with 2 distinct values or less.

Value

An object containing the dataset of the best performing transformation for each variable and the specified plots (list)

Examples

values <- rnorm(100)
x <- Rita(data = values)

[Package Rita version 1.2.0 Index]