suggest_levels {regclass} | R Documentation |
Combining levels of a categorical variable
Description
This function determines levels that are similar to each other either in terms of their average value of some quantitative variable or the percentages of each level of a two-level categorical variable. Use it to get a rough idea of what levels are "about the same" with regard to some variable.
Usage
suggest_levels(formula,data,maxlevels=NA,target=NA,recode=FALSE,plot=TRUE,...)
Arguments
formula |
A standard R formula written as y~x. Here, x is the variable whose levels you wish to combine, and y is the quantitative or two-level categorical variable. |
data |
An optional argument giving the name of the data frame that contains x and y. If not specified, the function will use existing definitions in the parent environment. |
maxlevels |
The maximum number of combined levels to consider (cannot exceed 26). |
target |
The number of resulting levels into which the levels of x will be combined. Will default to the suggested value of the fewest number whose resulting BIC is no more than 4 above the lowest BIC of any combination. |
recode |
|
plot |
|
... |
Additional arguments used to make the plot. Typically this will be |
Details
This function calculates the average value (or percentage of each level) of y for each level of x. It then builds a partition model taking y to be this average value (or percentage) with x being the predictor variable. The first split yields the "best" scheme for combining levels of x into 2 values. The second split yields the "best" scheme for combining levels of x into 3 values, etc.
The argument maxlevels
specifies the maximum numbers of levels in the combination scheme. By default, it will use the number of levels of x (ie, no combination). Setting this to a lower number saves time, since most likely a small number of combined levels is desired. This is useful for seeing how different combination schemes compare.
The argument target
will force the algorithm to producing exactly this number of combined levels. This is useful once you have determined how many levels of x you want.
If recode
is FALSE
, a table showing the combined levels along with the "BIC" of the combination scheme (lower is better, but a difference of around 4 or less is negligible). The suggested combination will be the fewer number of levels which has as BIC no more than 4 above the scheme that gave the lowest BIC.
If recode
is TRUE
, a list of three elements is produced. $Conversion1
gives a table of the Old and New levels alphabetized by Old while $Conversion2
gives a table of the Old and New levels alphabized by New. $newlevels
gives a factor of the cases levels under the new combination scheme. If target
is not set, it will use the suggested number of levels.
Author(s)
Adam Petrie
References
Introduction to Regression and Modeling
Examples
data(DONOR)
#Can levels of URBANICITY be treated the same with regards to probability of donation?
#Analysis suggests yes (all levels in one)
suggest_levels(Donate~URBANICITY,data=DONOR)
#Can levels of URBANICITY be treated the same with regards to donation amount?
#Analysis suggests yes, but perhaps there are four "effective levels"
suggest_levels(Donation.Amount~URBANICITY,data=DONOR)
SL <- suggest_levels(Donation.Amount~URBANICITY,data=DONOR,target=4,recode=TRUE)
SL$Conversion
#Add a column to the DONOR dataframe that contains these new cluster identities
DONOR$newCLUSTER_CODE <- SL$newlevels