R: Combining levels of a categorical variable

suggest_levels {regclass}

R Documentation

Combining levels of a categorical variable

Description

This function determines levels that are similar to each other either in terms of their average value of some quantitative variable or the percentages of each level of a two-level categorical variable. Use it to get a rough idea of what levels are "about the same" with regard to some variable.

Usage

suggest_levels(formula,data,maxlevels=NA,target=NA,recode=FALSE,plot=TRUE,...)

Arguments

`formula`	A standard R formula written as y~x. Here, x is the variable whose levels you wish to combine, and y is the quantitative or two-level categorical variable.
`data`	An optional argument giving the name of the data frame that contains x and y. If not specified, the function will use existing definitions in the parent environment.
`maxlevels`	The maximum number of combined levels to consider (cannot exceed 26).
`target`	The number of resulting levels into which the levels of x will be combined. Will default to the suggested value of the fewest number whose resulting BIC is no more than 4 above the lowest BIC of any combination.
`recode`	`TRUE` or `FALSE`. If `TRUE`, the function outputs a conversion table as well as the new level identities
`plot`	`TRUE` or `FALSE`. If `TRUE`, a plot is provided which shows the distribution of `y` for each level of `x` and lines showing which levels are grouped together.
`...`	Additional arguments used to make the plot. Typically this will be `equal=TRUE` and `inside=TRUE` to be passed to `mosaic`.

Details

This function calculates the average value (or percentage of each level) of y for each level of x. It then builds a partition model taking y to be this average value (or percentage) with x being the predictor variable. The first split yields the "best" scheme for combining levels of x into 2 values. The second split yields the "best" scheme for combining levels of x into 3 values, etc.

The argument maxlevels specifies the maximum numbers of levels in the combination scheme. By default, it will use the number of levels of x (ie, no combination). Setting this to a lower number saves time, since most likely a small number of combined levels is desired. This is useful for seeing how different combination schemes compare.

The argument target will force the algorithm to producing exactly this number of combined levels. This is useful once you have determined how many levels of x you want.

If recode is FALSE, a table showing the combined levels along with the "BIC" of the combination scheme (lower is better, but a difference of around 4 or less is negligible). The suggested combination will be the fewer number of levels which has as BIC no more than 4 above the scheme that gave the lowest BIC.

If recode is TRUE, a list of three elements is produced. $Conversion1 gives a table of the Old and New levels alphabetized by Old while $Conversion2 gives a table of the Old and New levels alphabized by New. $newlevels gives a factor of the cases levels under the new combination scheme. If target is not set, it will use the suggested number of levels.

Author(s)

Adam Petrie

References

Introduction to Regression and Modeling

Examples

 
 
  data(DONOR)
  
  #Can levels of URBANICITY be treated the same with regards to probability of donation?
  #Analysis suggests yes (all levels in one)
  suggest_levels(Donate~URBANICITY,data=DONOR)

  #Can levels of URBANICITY be treated the same with regards to donation amount?
  #Analysis suggests yes, but perhaps there are four "effective levels"
  
  suggest_levels(Donation.Amount~URBANICITY,data=DONOR)
  SL <- suggest_levels(Donation.Amount~URBANICITY,data=DONOR,target=4,recode=TRUE)
	SL$Conversion

	#Add a column to the DONOR dataframe that contains these new cluster identities
  DONOR$newCLUSTER_CODE <- SL$newlevels

[Package regclass version 1.6 Index]