cabootcrs {cabootcrs} | R Documentation |
Calculate category point variances using bootstrapping
Description
cabootcrs
performs simple or multiple correspondence analysis
and uses bootstrap resampling to
construct confidence ellipses for each appropriate category point,
printing and plotting the results;
for help on the package see cabootcrs-package
.
Usage
cabootcrs(
xobject = NULL,
datafile = NULL,
datasetname = NULL,
nboots = 999,
resampledistn = "Poisson",
multinomialtype = "whole",
printdims = 4,
lastaxis = 4,
maxrearrange = 6,
rearrangemethod = "lpassign",
usebootcrits = TRUE,
groupings = NULL,
grouplabels = NULL,
varnames = NULL,
plotsymbolscolours = c(19, "inferno", 18, "inferno"),
othersmonochrome = "grey",
crpercent = 95,
catype = "sca",
scainput = "CT",
mcainput = "nbyp",
mcatype = "Burt",
mcavariant = "mca",
mcasupplementary = "offdiag",
mcaadjustinertias = TRUE,
mcauseadjustinertiasum = FALSE,
mcaadjustcoords = TRUE,
mcaadjustmassctr = FALSE,
mcaoneploteach = TRUE,
mcashowindividuals = FALSE,
mcavariablecolours = FALSE,
mcacategorycolours = FALSE,
Jk = NULL,
varandcat = TRUE,
likertarrows = FALSE,
mcastoreindicator = TRUE,
mcaindividualboot = FALSE,
mcalikertnoise = 0.1,
poissonzeronewmean = 0,
newzeroreset = 0,
bootstdcoords = FALSE,
reflectonly = FALSE,
showresults = TRUE,
eps = 1e-15
)
Arguments
xobject |
Name of data object (data frame or similar class that can be coerced to data frame). |
datafile |
Name of a text file (in " ") containing the data, same defaults as xobject, ignored if xobject is non-null |
datasetname |
A string to use as the name of the data set in the plots, defaults to name of xobject or datafile |
nboots |
Number of boostrap replicate matrices used, default and recommended minimum is 999,
but 9999 is recommended if machine and data set size allows;
the calculated variances will sometimes differ
around the third decimal place, but the pictures should look the same. |
resampledistn |
Poisson resampling is the default for SCA,
|
multinomialtype |
Only relevant for multinomial sampling in SCA, otherwise ignored:
|
printdims |
Print full correspondence analysis coordinates, contributions, correlations etc for all output dimensions up to and including this one |
lastaxis |
Calculate variances and covariances for all output axes (dimensions) up to this one
(or the number of dimensions in the solution if smaller). |
maxrearrange |
The maximum number of axes to consider when rearranging |
rearrangemethod |
The method used to rearrange the axes:
Option is only included in case something weird goes wrong with lpSolve. |
usebootcrits |
To be passed to the plot routine, see |
groupings |
To be passed to the plot routine, see |
grouplabels |
To be passed to the plot routine, see |
varnames |
Character p-vector naming the variables, defaults to c("Rows","Columns") in sca |
plotsymbolscolours |
To be passed to the plot routine, see |
othersmonochrome |
To be passed to the plot routine, see |
crpercent |
To be passed to the plot routine, see |
catype |
Type of correspondence analysis:
|
scainput |
Format of input data, only applies for SCA:
|
mcainput |
Format of input data, only applies for MCA:
|
mcatype |
Format of data matrix analysed, only applies for MCA:
|
mcavariant |
Currently must be "mca", placeholder for future updates |
mcasupplementary |
How the sample points are projected as supplementary points onto the bootstrap axes when calculating the variances in MCA of a Burt matrix, see Details section for full explanation
If "offdiag" then when p=2 the variances will be very similar to those from SCA. |
mcaadjustinertias |
Whether to adjust inertias to allow for the meaningless inertia terms induced by the diagonal of the Burt matrix in MCA:
If TRUE then when p=2 the inertias will agree with those from SCA. |
mcauseadjustinertiasum |
How to define the total inertia in MCA, whether to just use the sum of the adjusted inertias:
If TRUE then when p=2 the inertias will agree with those from SCA |
mcaadjustcoords |
Whether to adjust the principal coordinates in MCA using the adjusted inertias above, as in Greenacre and Blasius, p68:
If TRUE then when p=2 the coordinates will agree with those from SCA. |
mcaadjustmassctr |
Whether to adjust the point masses and column contributions in MCA so that the masses and contributions are with respect to each variable (as in SCA) rather than with respect to all variables together:
If TRUE then when p=2 the CTR will agree with those from SCA, though when p>2 the contributions can be >1 |
mcaoneploteach |
Parameter passed to
|
mcashowindividuals |
Parameter passed to
|
mcavariablecolours |
Parameter passed to
|
mcacategorycolours |
Parameter passed to
|
Jk |
The number of classes for each variable in MCA, as a list or vector, which only needs specifying when inputting an indicator matrix, as in other cases it can be derived from the input matrix |
varandcat |
Flag for how to construct variable category names:
|
likertarrows |
Parameter passed to |
mcastoreindicator |
If TRUE then store the indicator matrix created for MCA |
mcaindividualboot |
If TRUE then use the experimental method to bootstrap an indicator or doubled matrix, see Details section part (2) for full explanation |
mcalikertnoise |
The "noise" value to use in the experimental method (above) to bootstrap an indicator or doubled matrix, see Details section part (2) for full explanation |
poissonzeronewmean |
Experimental method for SCA to deal with contingency tables where zero cells
could have been non-zero, i.e. they are not structural zeros. |
newzeroreset |
Experimental method for SCA to deal with sparse contingency tables. |
bootstdcoords |
If TRUE then produce bootstrap variances for points in standard coordinates
instead of principal coordinates |
reflectonly |
If TRUE then just allow for axis reflections and not axis reorderings |
showresults |
If TRUE then output the results using |
eps |
Any value less than this is treated as zero in some calculations and comparisons |
Details
This routine performs all of the usual Correspondence Analysis calculations while also using bootstrapping to estimate the variance of the difference between the sample and population point when both are projected onto the sample axes in principal coordinates. This is done for each row and column category on each dimension of the solution, allowing for sampling variation in both the points and the axes.
It hence constructs confidence ellipses for each category point, plots the results
by a call to plotca
and prints the usual Correspondence Analysis summary
output and the calculated standard deviations through a call to summaryca
.
Use printca
for more detailed numerical results.
For further examples and help on the package as a whole see cabootcrs-package
.
(1) Corrections for Burt diagonal
It is well-known that in multiple CA (MCA) the results are distorted by the diagonal elements of the Burt matrix. As well as the standard methods to correct for this, here we propose and implement a new method to correct for this when bootstrapping. If bootstrapping is applied in a naive way then, even when the standard corrections are used, the estimated variances will be much too small because diagonal elements of the standardised Burt matrix are the same in every bootstrap replicate, thus underestimating the true variation in the data.
All bootstrapping is performed on the indicator matrix (or equivalently the n by p matrix) and the resampled Burt matrix is then constructed from the resampled indicator matrix in the usual way.
Included here are the usual corrections to the inertias (mcaadjustinertias=TRUE, the default) and the coordinates (mcaadjustcoords=TRUE, the default). In addition you can choose to use, as the total inertia, either the sum of these adjusted inertias (mcauseadjustinertiasum=TRUE) as proposed by Benzecri or the average of the off-diagonal inertias (mcauseadjustinertiasum=FALSE, the default) as proposed by Greenacre. You can adjust (multiply by p) the Contribution figures in MCA so that they sum to p over all variables, i.e. an average of 1 for each variable as in SCA (mcaadjustmassctr=TRUE), rather than a total of 1 over all variables, as usually in MCA (mcaadjustmassctr=FALSE, the default). Note that when p=2 this will be the same as in SCA, but when p>3 you can get contributions greater than 1, so use with caution. This also adjusts (multiplies by p) the point masses so that they sum to 1 for each variable, rather than over all variables, same caveat applies.
The fundamental problem with MCA is that in a Burt matrix each diagonal element is the value of a variable category cross-classified with itself, so it is always equal to the number of times that category appears. Hence if a category appears k times then the row (or column) in the Burt matrix consists of p blocks each of which sum to k, so its row (and column) sum is kp.
Therefore when the row (or column) profile matrix is calculated the diagonal elements of the Burt matrix are always all 1/p while the elements for the categories in the offdiagonal blocks sum to 1/p in each block. Hence when the projected difference between the sample and resample row (or column) profile matrices is calculated this is artificially small because the diagonals of the two matrices are always the same, no matter how different the off-diagonal elements are.
The new method to correct for the diagonal elements of the Burt matrix when calculating variances therefore works by re-expressing the coordinates purely in terms of the interesting and variable off-diagonal elements, excluding the uninteresting and constant diagonal elements.
First calculate the Burt principal coordinates (PC), but with the diagonal elements of the profile matrix ignored (or set to zero). It is easy to verify that the usual Burt principal coordinates can be re-expressed, using the singular values (SV), as
Burt PC = ( Burt SV / (Burt SV - (1/p)) ) x Burt PC without diagonal element
Hence in the bootstrapping the sample and resample points are re-expressed in this way and their differences when projected onto the bootstrap axes are calculated as usual.
Similarly the adjusted principal coordinates are calculated as
adj Burt PC = (p/(p-1)) x ( (Burt SV - (1/p)) / Burt SV ) x Burt PC
So, when using the usual adjusted coordinates (and adjusted inertias), both of the above will be used, hence ending up with just a correction of (p/(p-1)). The unadjusted coordinates can be used, but this is not recommended.
One consequence of this correction is that when mcaadjustinertias=TRUE, mcaadjustcoords=TRUE, mcauseadjustinertiasum=TRUE, mcaadjustmassctr=TRUE and mcasupplementary="offdiag" then when p=2 the bootstrap variances for MCA are almost the same as those for SCA, while all other results for MCA are the same as those for SCA. The package author regards this as a good thing, making MCA more of a proper generalisation of MCA, but recognises that some people regard MCA as a fundamentally different method to SCA, linked only by the common algebra.
Note that if this adjustment is not made then in the p=2 case it is easy to see that the projected differences are half those in the SCA case and the standard deviations are a quarter the size.
The new method will be written up for publication once this update to the package is finished.
(2) Experimental method for bootstrapping indicator matrices
A highly experimental method is included for bootstrapping with an indicator or doubled matrix
in the case of ordered categorical (e.g. Likert scale) data.
This has not been studied or optimised extensively, and is currently very slow, so use is very
much at the user's discretion and at user's risk. To try it, choose:
mcatype="indicator" or mcatype="doubled" with nboots>0
If CRs are required for the individual points then also choose:
mcaindividualboot=TRUE
The bootstrap methodology used here relies on the comparison of bootstrap to sample points when both are projected onto bootstrap axes. In SCA this is fine because when looking at column points the axes are given by the rows and vice versa, and row i and column j each represent the same category in all bootstrap replicates. Similarly in MCA with a Burt matrix when looking at column points the axes are given by the rows, which again each represent the same category in all bootstrap replicates.
However, with an indicator or doubled matrix, row i of the matrix does not represent the same individual in each replicate, it just represents the i-th individual drawn in the resampling for that particular replicate. In order for this type of bootstrapping to work with an indicator or doubled matrix we would need a resampling method whereby the i-th row represents the i-th individual in all bootstrap replicates. The resampled row would need to represent the answers of the same individual "on another day". This might make sense with questionnaire data where an individual's answers have uncertainty attached, in that if asked the same questions on multiple occasions they would give different answers due to random variation rather than temporal change. If a believable model for the sampling, and hence the resampling, could be derived (CUB models perhaps) then this could lead to CRs for an individual point, representing the uncertainty in what that person actually thinks.
The method here uses the same idea of treating the sample row as representing an individual, and bootstrap replicates of that row as representing the variability in how that individual might have answered the questionnaire (or similar). However, instead of an explicit model to represent this variability, it is generated by the data themselves. The bootstrap replicate indicator (or doubled) matrix is generated as usual, by either non-parametric (multinomial) or balanced resampling, and then its rows are "matched" to the rows of the sample indicator (or doubled) matrix. The rows of the resampled matrix are reordered so that the rows of the resampled matrix are, overall, as similar as possible to the same rows of the sample matrix. Hence the resampled rows can reasonably be viewed as representing the same individual in each bootstrap replicate, and hence variances for the column points (categories) and row points (individuals) can be produced.
The matching again uses the Hungarian algorithm from lpSolve. This only makes sense if all variables are ordered categorical. Applying this to (ordered) categorical data results in a very large number of ties, so the mcalikertnoise parameter defines the standard deviation of white noise added to each of the sample and resample category numbers for the purpose of the matching (only).
Note that this is very slow for all but small data sets, and if all of the CRs for individuals are shown then the plot is impossibly busy. Hence it is recommended that this experimental method only be used for fairly small data sets where CRs are wanted for only a few example individuals. A better approach might be to average the CRs of all of the individuals who give the same results in the sample, but this is not implemented yet.
Note that supplementary row points in indicator matrix MCA are usually regarded as different people answering the same questions, whereas in this case for CRs to make any sense we need to regard them as the same people answering the same questions on different days.
(3) Critical values
Bootstrap critical values are calculated by re-using the bootstrap replicates used to calculate
the variances, with a critical value calculated for each ellipse.
The projected differences between bootstrap and sample points are ordered and the appropriate
percentile value picked. These are usually slightly larger than the \chi^2
critical values.
Only 90%, 95% (default) and 99% critical values are calculated.
Alternatively use \chi^2
critical values, usually with df=2, but with df=1 if only 2 non-zero cells in row/col.
The experimental method to construct ellipses for individuals in MCA always uses bootstrap critical values
Value
An object of class cabootcrsresults
See Also
cabootcrs-package
, cabootcrsresults
,
plotca
, printca
, summaryca
,
covmat
, allvarscovs
Examples
# Simple CA (SCA) of a 5 by 4 contingency table, using all SCA defaults:
# 999 bootstraps, Poisson resampling, variances for up to first four axes,
# usual output for up to the first 4 axes,
# one biplot with CRs for rows in principal coordinates and another with
# CRs for columns in principal coordinates
bd <- cabootcrs(DreamData)
## Not run:
# Same data set with a completely random three-category third variable added,
# analysed with MCA but with standardisations which mimic SCA as much as possible
bd3 <- cabootcrs(DreamData223by3, catype="mca")
Explicitly stating what the rows and columns represent, often needed for a contingency table
bd <- cabootcrs(DreamData, datasetname="Maxwell's dream data",
varnames=c("What the rows are","What the columns are"))
# Multiple CA (MCA) of 3 categorical variables with all defaults:
# non-parametric resampling, Burt matrix analysed,
# each variable has one plot with it in colour with CRs shown, other variables in monochrome.
# Same data set but now as 223 by 3 matrix, with random 3rd column (with 3 categories) added.
bd3 <- cabootcrs(DreamData223by3, catype="mca")
# Comparison of SCA to MCA with p=2, by converting contingency table to 223 by 2 matrix.
# Note that the coordinates and inertias etc are the same while the standard deviations
# and hence the ellipses are very similar but not identical.
bd <- cabootcrs(DreamData)
DreamData223by2 <- convert(DreamData,input="CT",output="nbyp")$result
bdmca <- cabootcrs(DreamData223by2, catype="mca", varandcat=FALSE)
# Not adjusting inertias, which means that coordinates will also not be adjusted and
# the bootstrapping will use the Burt diagonal.
# Note how the coordinates are larger but the inertias and ellipses are smaller.
bdmcaunadj <- cabootcrs(DreamData223by2, catype="mca", varandcat=FALSE, mcaadjustinertias=FALSE)
# Applying the standard adjustments to inertias and coordinates, but with
# the bootstrapping still using the Burt diagonal.
# Note how inertias and coordinates are now the same as SCA, but ellipses are smaller.
bdmcaadjbutall <- cabootcrs(DreamData223by2, catype="mca", varandcat=FALSE, mcasupplementary="all")
# Effect of sample size in SCA:
bdx4 <- cabootcrs(4*DreamData)
bdx9 <- cabootcrs(9*DreamData)
ba <- cabootcrs(AttachmentData)
bs <- cabootcrs(SuicideData)
bas <- cabootcrs(AsbestosData)
# Options for SCA:
# SCA with multinomial resampling, with the matrix treated as a single multinomial distribution
bdm <- cabootcrs(DreamData, resampledistn="multinomial")
# Fix the row sums, i.e. keep sum of age group constant
bdmrf <- cabootcrs(DreamData, resampledistn="multinomial", multinomialtype="rowsfixed")
# Use chi-squared critical values for the CRs
bdchisq <- cabootcrs(DreamData, usebootcrits=FALSE)
# Just perform correspondence analysis, without bootstrapping
bdnb0 <- cabootcrs(DreamData, nboots=0)
# Effect of sample size in MCA:
bn <- cabootcrs(NishData, catype="mca")
# Options for MCA
# Using default settings the SCA and MCA standard results are the same when p=2,
# bootstrap standard deviations (multinomial/nonparametric) are similar but not identical
bdsca <- cabootcrs(DreamData,resampledistn="multinomial")
bdmca <- cabootcrs(convert(DreamData,input="CT",output="nbyp")$result, catype="mca")
# Row A can be labelled A rather than R:A
# because the three variables have all different category names
bd3l <- cabootcrs(DreamData223by3, catype="mca", varandcat=FALSE)
# Balanced resampling, each of the 223 rows occurs 999 times in the 999 resamples
bd3b <- cabootcrs(DreamData223by3, catype="mca", resampledistn="balanced")
# Do not adjust inertias, coordinates or contributions
# (if inertias are not adjusted then coordinates are also not adjusted)
bd3unadj <- cabootcrs(DreamData223by3, catype="mca",mcaadjustinertias=FALSE)
## Comparisons to ellipses from FactoMineR
# Generate some completely random uniform categorical data, construct ellipses.
# The cabootcrs ellipses are very large and overlap extensively, as you would expect
# from completely random data.
# The FactoMineR ellipses are much smaller, often with minimal overlaps,
# giving a completely false impression of genuine differences between categories.
library(FactoMineR)
p <- 4
maxcat <- 5
n <- 100
Xnpr <- apply( as.data.frame( matrix( round(runif(n*p,0.5,maxcat+0.5)), n, p)), 2, factor )
fr <- MCA(Xnpr, method="Burt")
plotellipses(fr)
br <- cabootcrs(Xnpr, catype="mca", showresults=FALSE)
plotca(br, mcacategorycolours = TRUE, showcolumnlabels=FALSE)
## Comparisons to results in ca and FactoMineR
Summary: If using unadjusted inertias, coordinates the packages produce identical results,
apart from differences in presentation (rounding off, the naming of rep/cor/cos2).
Summary: When using adjusted inertias and coordinates (not an option in FactoMineR::MCA)
the correlations in ca::mjca no longer sum to 1 over all dimensions, in cabootcrs they do.
Ratios are the same for each dimension, but not each point, they are standardised differently.
# Example comparisons with random data
library(FactoMineR)
library(ca)
p <- 4
maxcat <- 5
n <- 100
Xnpdf <- as.data.frame( matrix( round(runif(n*p,0.5,maxcat+0.5)), n, p))
Xnpr <- apply( Xnpdf, 2, factor )
# Note that ca::mjca only accepts the data as numerical,
# FactoMineR::MCA only acccepts the data as characters
rbun <- cabootcrs(Xnpr, catype="mca", nboots=0, mcaadjustinertias = FALSE)
rcun <- mjca(Xnpdf,lambda="Burt")
summary(rcun)
rfm <- MCA(Xnpr,method="Burt", graph=FALSE)
summary(rfm)
rb <- cabootcrs(Xnpr, catype="mca", nboots=0)
rc <- mjca(Xnpdf)
summary(rc)
realr <- rb@br@realr
rb@ColREP[,1:realr]
rc$colcor[,1:realr]
apply(rb@ColREP[,1:realr],1,"sum")
apply(rc$colcor[,1:realr],1,"sum")
## End(Not run)