igate {igate} | R Documentation |
igate function for continuous target variables
Description
This function performs an initial Guided Analysis for parameter testing and controlband extraction (iGATE) on a dataset and returns those parameters found to be influential.
Usage
igate(df, versus = 8, target, test = "w", ssv = NULL,
outlier_removal_target = TRUE, outlier_removal_ssv = TRUE,
good_end = "low", savePlots = FALSE, image_directory = tempdir())
Arguments
df |
Data frame to be analysed. |
versus |
How many Best of the Best and Worst of the Worst do we collect? By default, we will collect 8 of each. |
target |
Target varaible to be analysed. Must be continuous. Use |
test |
Statistical hypothesis test to be used to determine influential
process parameters. Choose between Wilcoxon Rank test ( |
ssv |
A vector of suspected sources of variation. These are the variables
in |
outlier_removal_target |
Logical. Should outliers (with respect to the target variable)
be removed from df (default: |
outlier_removal_ssv |
Logical. Should outlier removal be performed for each ssv (default: |
good_end |
Are low (default) or high values of target variable good? This is needed to determine the control bands. |
savePlots |
Logical, only relevant if |
image_directory |
Directory to which plots should be saved. This is only used if |
Details
We collect the Best of the Best and the Worst of the Worst dynamically dependent on the current ssv. That means, for each ssv we first remove all the observations with missing values for that ssv from df. Then, based on the remaining observations, we select versus observations with the best values for the target variable (“Best of the Best”, short BOB) and versus observations with the worst values for the target variable (“Worst of the Worst”, short WOW). By default, we select 8 of each. Next, we compare BOB and WOW using the the counting method and the specified hypothesis test. If the distributions of the ssv in BOB and WOW are significantly different, the current ssv has been identified as influential to the target variable. An ssv is considered influential, if the test returns a count larger/ equal to 6 and/ or a p-value of less than 0.05. For the next ssv we again start with the entire dataset df, remove all the observations with missing values for that new ssv and then select our new BOB and WOW. In particular, for each ssv we might select different observations. This dynamic selection is necessary, because in case of an incomplete data set, if we select the same BOB and WOW for all the ssv, we might end up with many missing values for particular ssv. In that case the hypothesis test loses statistical power, because it is used on a smaller sample or worse, might fail altogether if the sample size gets too small.
For those ssv determined to be significant, control bands are extracted. The rationale is:
If the value for an ssv is in the interval [good_lower_bound
,good_upper_bound
]
the target is likely to be good. If it is in the interval
[bad_lower_bound
,bad_upper_bound
], the target is likely to be bad.
Furthermore some summary statistics are provided: When selecting the versus
BOB/ WOW, tied values for target
can mean that the versus
BOB/ WOW are not uniquely determined. In that case we randomly select
from the tied observations to give us exactly versus
observations per group.
ties_lower_end, cometition_lower_end, ties_upper_end, competition_upper_end
quantify this randomness. How to interpret these values: lower end refers to
the group whose target
values are low and upper end to the one whose
target
values are high. For example if a low value for target
is good,
lower end refers to the BOB and upper end to the WOW. We determine the versus
BOB/ WOW via
lower_end <- df[min_rank(df$target)<=versus,]
If there are tied observations, nrow(lower_end)
can be larger than versus
. In ties_lower_end
we
record how many observations in lower_end$target
have the highest value and in competition_lower_end
we record for how many places they are competing, i.e.
competing_for_lower <- versus - (nrow(lower_end) - ties_lower_end)
.
The values for ties_upper_end
and competition_upper_end
are determined analogously.
Value
A data frame with the following columns
Causes | Those ssv that have been found to be influential to the target variable. |
Count | The value returned by the counting method. |
p.value | The p-value of the hypothesis test performed, i.e. either of the
Wilcoxon rank test (in case test = "w" ) or the t-test (if test = "t" ). |
good_lower_bound | The lower bound for this Cause for good quality. |
good_upper_bound | The upper bound for this Cause for good quality. |
bad_lower_bound | The lower bound for this Cause for bad quality. |
bad_upper_bound | The upper bound for this Cause for bad quality. |
na_removed | How many missing values were in the data set for this Cause ? |
ties_lower_end | Number of tied observations at lower end of target when selecting the
versus BOB/ WOW. |
competition_lower_end | For how many positions are the tied_obs_lower competing? |
ties_upper_end | Number of tied observations at upper end of target when selecting the
versus BOB/ WOW. |
competition_upper_end | For how many positions are the tied_obs_upper competing? |
adjusted.p.values | The p.values adjusted via Bonferroni correction.
|
Examples
igate(iris, target = "Sepal.Length")