catdap2 {catdap}  R Documentation 
Search for the best single explanatory variable and detect the best subset of explanatory variables.
catdap2(data, pool = NULL, response.name, accuracy = NULL, nvar = NULL, additional.output = NULL, missingmark = NULL, pa1 = 1, pa2 = 4, pa3 = 10, print.level = 0, plot = 1)
data 
data matrix with variable names on the first row. 
pool 
the ways of pooling to categorize each variable must be specified by integer parameters:

response.name 
variable name of the response variable. 
accuracy 
minimum width for the discretization for each variable. 
nvar 
number of variables to be retained for the analysis of
multidimensional tables. Default is the number of variables in 
additional.output 
list of sets of explanatory variable names for additional output. 
missingmark 
positive number for handling missing value. See 'Details'. 
pa1, pa2, pa3 
control parameter for size of the working area. If error message is output, please change the value of parameter according to it. 
print.level 
this argument determines the level of output printing. The
default value of ' 
plot 
split directions for each level of the mosaic:

This function is an Rfunction style clone of Sakamoto's CATDAP02 program for categorical data analysis. CATDAP02 can be used to search for the best subset of explanatory variables which have the most effective information on a specified response variable. Continuous explanatory variables could be explanatory variables. In that case CATDAP02 searches for optimal categorization of continuous values.
The basic statistic adopted is obtained by the application of the statistic AIC to the models.
E denotes the response variable and F denotes candidate explanatory variable, and their cell frequencies by n_E(i) (i in (E)) and n_F(j) (j in (F)). The cross frequency is denoted by n_F,F(i,j) (i,j in (E,F)). To measure the strength of dependence of a specific set of response variables E on the explanatory variable F, we use the following statistic:
AIC(E;F) = 2 ∑ _{i,j in (E,F)} n_E,F(i,j) ln{n_E,F(i,j)/(n_F(j)} + 2(C_E1)C_F, (1)
where C_E and C_F denote the total number of categories of the corresponding sets of variables, respectively.
The selection of the best subset of explanatory variables is realized by the search for F which gives the minimum AIC(E;F).
In case of F=φ, the formula (1) reduces to
AIC(E;φ) = 2 ∑ _{i in (E)} n_E(i) ln{n_E(i)/n} + 2(C_E1).
Here it is assumed that C_φ=1 and n_φ(1)=n.
Sakamoto's original CATDAP outputs AIC(E;F)  AIC(E;φ) as the AIC value instead of AIC(E;F). By this way the positive value of AIC indicates that the variable F is judged to be useless as the explanatory variable of the E.
On the other hand, this policy make impossible to compare the goodness of the CATDAP model with other models, logit models for example.
Considering the convenience of users, present "R version CATDAP" provides not only AIC = AIC(E;F)  AIC(E;φ), but AIC(E;φ), either. The latter value is given as base_AIC in the output.
Users could recover AIC(E;F) by adding AIC and base_AIC.
missingmark
enables missing value handling.
When a positive values, say 1000, is set here, any value, say x,
greater than or equal to 1000 is treated as a missing value. If
1000 <= x < 2000, x is treated as a missing
value of the 1st type. If 2000 <= x < 3000, x
is treated as a missing value of the 2nd type, and so on. Generally speaking,
any x that 1000*k <= x < 1000*(k+1) is
treated as the kth type missing value. Users are referred to the
reference for the technical details of the missing value handling procedure.
For continuous variables, we assume that b(1), b(2), …, b(m+1) are boundary values of m bins. Output value ranges r(i) (1 ≤ i ≤ m) are defined as follows :
r(i) = [ b(i), b(i+1) ) for 1 <= i < m,
r(m) = [ b(m), b(m+1) ] .
Specifically, for continuous response variable V,
r(i) = [ x_min + (i1)*s, x_min + i*s ) for 1<= i < m,
r(m) = [ x_min + (m1)*s, x_max ] ,
where x_min and x_max are the minimum and the maximums of variable V respectively and s = (x_max  x_min) / m.
tway.table 
twoway tables. 
total 
total number of data with corresponding code of variables. 
interval 
class interval for continuous and discrete explanatory variables. 
base.aic 
base_AIC. 
aic 
AIC's of single explanatory variables. 
aic.order 
list of explanatory variable numbers arranged in ascending order of AIC. 
nsub 
number of subsets of explanatory variables. 
subset 
list of subsets of explanatory variables in ascending order of AIC with the following components:

ctable 
contingency table constructed by the best subset and additional
subsets if any variables is specified by 
ctable.interval 
class interval for continuous and discrete explanatory variables in contingency table. 
caic 
AIC of subset of explanatory variables in contingency table. 
missing 
number of types of the missing values for each variable. 
K.Katsura and Y.Sakamoto (1980) Computer Science Monograph, No.14, CATDAP, A Categorical Data Analysis Program Package. The Institute of Statistical Mathematics.
Y.Sakamoto (1985) Model Analysis of Categorical Data. Kyoritsu Shuppan Co., Ltd., Tokyo. (in Japanese)
Y.Sakamoto (1985) Categorical Data Analysis by AIC. Kluwer Academic publishers.
An AICbased Tool for Data Visualization (2015), NTT DATA Mathematical Systems Inc. (in Japanese)
# Example 1 (medical data "HealthData") # as additional output, contingency tables for explanatory variable sets # c("aortic.wav","min.press") and c("ecg","age") are obtained. data(HealthData) catdap2(HealthData, c(2, 2, 2, 0, 0, 0, 0, 2), "symptoms", c(0., 0., 0., 1., 1., 1., 0.1, 0.), , list(c("aortic.wav", "min.press"), c("ecg", "age"))) # Example 2 (Edgar Anderson's Iris Data) # continuous response variable handling and the usage of Barplot2WayTable # function to visualize the result in shape of stacked histogram. data(iris) resvar < "Petal.Width" z < catdap2(iris, c(0, 0, 0, 7, 2), resvar, c(0.1, 0.1, 0.1, 0.1, 0)) z vname < names(iris) exvar < c("Sepal.Length", "Petal.Length") Barplot2WayTable(vname, resvar, exvar, z$tway.table, z$interval) # Example 3 (in the case of a large number of variables) data(HelloGoodbye) pool < rep(2, 56) ## using the default values of parameters pa1, pa2, pa3 ## catdap2(HelloGoodbye, pool, "Isay", nvar = 10, print.level = 1, plot = 0) ## Error : Working area for contingency table is too short, try pa1 = 12. ### According to the error message, set the parameter p1 at 12, then .. catdap2(HelloGoodbye, pool, "Isay", nvar = 10, pa1 = 12, print.level = 1, plot = 0) # Example 4 (HealthData with missing values) data(MissingHealthData) catdap2(MissingHealthData, c(2, 2, 2, 0, 0, 0, 0, 2), "symptoms", c(0., 0., 0., 1., 1., 1., 0.1, 0.), missingmark = 300)