R: Empirical Classification Analysis (CA) and Inference

ca {SortedEffects}

R Documentation

Empirical Classification Analysis (CA) and Inference

Description

ca conducts CA estimation and inference on user-specified objects of interest: first (weighted) moment or (weighted) distribution. Users can use t to specify variables in interest. When object of interest is moment, use cl to specify whether want to see averages or difference of the two groups.

Usage

ca(
  fm,
  data,
  method = c("ols", "logit", "probit", "QR"),
  var_type = c("binary", "continuous", "categorical"),
  var,
  compare,
  subgroup = NULL,
  samp_weight = NULL,
  taus = c(5:95)/100,
  u = 0.1,
  interest = c("moment", "dist"),
  t = c(1, 1, rep(0, dim(data)[2] - 2)),
  cl = c("both", "diff"),
  cat = NULL,
  alpha = 0.1,
  b = 500,
  parallel = FALSE,
  ncores = detectCores(),
  seed = 1,
  bc = TRUE,
  range_cb = c(1:99)/100,
  boot_type = c("nonpar", "weighted")
)

Arguments

`fm`	Regression formula
`data`	The data in use: full sample or subpopulation in interset
`method`	Models to be used for estimating partial effects. Four options: `"logit"` (binary response), `"probit"` (binary response), `"ols"` (interactive linear with additive errors), `"QR"` (linear model with non-additive errors). Default is `"ols"`.
`var_type`	The type of parameter in interest. Three options: `"binary"`, `"categorical"`, `"continuous"`. Default is `"binary"`.
`var`	Variable T in interset. Should be a character.
`compare`	If parameter in interest is categorical, then user needs to specify which two category to compare with. Should be a 1 by 2 character vector. For example, if the two levels to compare with is 1 and 3, then `c=("1", "3")`, which will calculate partial effect from 1 to 3. To use this option, users first need to specify `var` as a factor variable.
`subgroup`	Subgroup in interest. Default is `NULL`. Specifcation should be a logical variable. For example, suppose data contain indicator variable for women (female if 1, male if 0). If users are interested in women SPE, then users should specify `subgroup = data[, "female"] == 1`.
`samp_weight`	Sampling weight of data. Input should be a n by 1 vector, where n denotes sample size. Default is `NULL`.
`taus`	Indexes for quantile regression. Default is `c(5:95)/100`.
`u`	Percentile of most and least affected. Default is set to be 0.1.
`interest`	Generic objects in the least and most affected subpopulations. Two options: (1) `"moment"`: weighted mean of Z in the u-least/most affected subpopulation. (2) `"dist"`: distribution of Z in the u-least/most affected subpopulation. Default is `interest = "moment"`.
`t`	An index for ca object. Should be a 1 by ncol(data) indicator vector. Users can either (1) specify names of variables of interest directly, or (2) use 1 to indicate the variable of interest. For example, total number of variables is 5 and interested in the 1st and 3rd vars, then specify `t = c(1, 0, 1, 0, 0)`.
`cl`	If `moment = "interest"`, `cl` allows the user to get the variables of interest (specified in `t` option) of the most and least affected groups. The default is `"both"`, which shows the variables of the two groups; the alternative is `"diff"`, which shows the difference of the two groups. The user can use the `summary.ca` to tabulate the results, which also contain the standard errors and p- values. If `interest = "dist"`, this option doesn't have any bearing and user can leave it to be the default value.
`cat`	P-values in classification analysis are adjusted for multiplicity to account for joint testing of zero coefficients on for all variables within a category. Suppose we have selected specified 3 variables in interest: `t = c("a", "b", "c")`. Without loss of generality, assume `"a"` is not a factor, while `"b"` and `"c"` are two factors. Then users need to specify as `cat = c("b", "c")`. Default is `NULL`.
`alpha`	Size for confidence interval. Shoule be between 0 and 1. Default is 0.1
`b`	Number of bootstrap draws. Default is 500.
`parallel`	Whether the user wants to use parallel computation. The default is `FALSE` and only 1 CPU will be used. The other option is `TRUE`, and user can specify the number of CPUs in the `ncores` option.
`ncores`	Number of cores for computation. Default is set to be `detectCores()`, which is a function from package `parallel` that detects the number of CPUs on the current host. For large dataset, parallel computing is highly recommended since bootstrap is time-consuming.
`seed`	Pseudo-number generation for reproduction. Default is 1.
`bc`	Whether want the estimate to be bias-corrected. Default is `TRUE`. If `FALSE` uncorrected estimate and corresponding confidence bands will be reported.
`range_cb`	When `interest = "dist"`, we sort and unique variables in interest to estimate weighted CDF. For large dataset there can be memory problem storing very many of observations, and thus users can provide a Sort value and the package will sort and unique based on the weighted quantile of Sort. If users don't want this feature, set `range_cb = NULL`. Default is `c(1:99)/100`.
`boot_type`	Type of bootstrap. Default is `"nonpar"`, and the package implements nonparametric bootstrap. The alternative is `"weighted"`, and the package implements weighted bootstrap.

Details

All estimates are bias-corrected and all confidence bands are monotonized. The bootstrap procedures follow algorithm 2.2 as in Chernozhukov, Fernandez-Val and Luo (2018).

Value

If subgroup = NULL, all outputs are whole sample. Otherwise output are subgroup results. When interest = "moment", the output is a list showing

est Estimates of variables in interest.
bse Bootstrap standard errors.
joint_p P-values that are adjusted for multiplicity to account for joint testing for all variables.
pointwise_p P-values that doesn't adjust for join testing

If users have further specified cat (e.g., !is.null(cat)), the fourth component will be replaced with p_cat: P-values that are a djusted for multiplicity to account for joint testing for all variables within a category. Users can use summary.ca to tabulate the results.

When interest = "dist", the output is a list of two components:

infresults A list that stores estimates, upper and lower confidence bounds for all variables in interest for least and most affected groups.
sortvar A list that stores sorted and unique variables in interest.

We recommend using plot.ca command for result visualization.

Examples

data("mortgage")
### Regression Specification
fm <- deny ~ black + p_irat + hse_inc + ccred + mcred + pubrec +
ltv_med + ltv_high + denpmi + selfemp + single + hischl
### Specify characteristics of interest
t <- c("deny", "p_irat", "black", "hse_inc", "ccred", "mcred", "pubrec",
"denpmi", "selfemp", "single", "hischl", "ltv_med", "ltv_high")
### issue ca command
CA <- ca(fm = fm, data = mortgage, var = "black", method = "logit",
cl = "diff", t = t, b = 50, bc = TRUE)

[Package SortedEffects version 1.7.0 Index]