R: Bivariate Correlation

cor.sdf {EdSurvey}

R Documentation

Bivariate Correlation

Description

Computes the correlation of two variables on an edsurvey.data.frame, a light.edsurvey.data.frame, or an edsurvey.data.frame.list. The correlation accounts for plausible values and the survey design.

Usage

cor.sdf(
  x,
  y,
  data,
  method = c("Pearson", "Spearman", "Polychoric", "Polyserial"),
  weightVar = "default",
  reorder = NULL,
  dropOmittedLevels = TRUE,
  defaultConditions = TRUE,
  recode = NULL,
  condenseLevels = TRUE,
  fisherZ = if (match.arg(method) %in% "Pearson") {
     TRUE
 } else {
     FALSE
 },
  jrrIMax = Inf,
  verbose = TRUE,
  omittedLevels = deprecated()
)

Arguments

`x`	a character variable name from the `data` to be correlated with `y`
`y`	a character variable name from the `data` to be correlated with `x`
`data`	an `edsurvey.data.frame`, a `light.edsurvey.data.frame`, or an `edsurvey.data.frame.list`
`method`	a character string indicating which correlation coefficient (or covariance) is to be computed. One of `Pearson` (default), `Spearman`, `Polychoric`, or `Polyserial`. For Polyserial, the continuous argument must be `x`.
`weightVar`	character indicating the weight variable to use. See Details section in `lm.sdf`.
`reorder`	a list of variables to reorder. Defaults to `NULL` (no variables are reordered). Can be set as `reorder` `=` `list(var1` `=` `c("a","b","c"),` `var2` `=` `c("4", "3", "2", "1"))`. See Examples.
`dropOmittedLevels`	a logical value. When set to the default value of `TRUE`, drops those levels of all factor variables that are specified in an `edsurvey.data.frame`. Use `print` on an `edsurvey.data.frame` to see the omitted levels.
`defaultConditions`	a logical value. When set to the default value of `TRUE`, uses the default conditions stored in an `edsurvey.data.frame` to subset the data. Use `print` on an `edsurvey.data.frame` to see the default conditions.
`recode`	a list of lists to recode variables. Defaults to `NULL`. Can be set as `recode` `=` `list(var1` `=` `list(from` `=` `c("a","b","c"), to` `=` `"d"))`. See Examples.
`condenseLevels`	a logical value. When set to the default value of `TRUE` and either `x` or `y` is a categorical variable, the function will drop all unused levels and rank the levels of the variable before calculating the correlation. When set to `FALSE`, the numeric levels of the variable remain the same as in the codebook. See Examples.
`fisherZ`	for standard error and mean calculations, set to `TRUE` to use the Fisher Z-transformation (see details), or `FALSE` to use no transformation of the data. The `fisherZ` argument defaults to Fisher Z-transformation for Pearson and no transformation for other correlation types.
`jrrIMax`	a numeric value; when using the jackknife variance estimation method, the default estimation option, `jrrIMax=Inf`, uses the sampling variance from all plausible values as the component for sampling variance estimation. The `Vjrr` term (see Statistical Methods Used in EdSurvey) can be estimated with any number of plausible values, and values larger than the number of plausible values on the survey (including `Inf`) will result in all plausible values being used. Higher values of `jrrIMax` lead to longer computing times and more accurate variance estimates.
`verbose`	a logical value. Set to `FALSE` to avoid messages about variable conversion.
`omittedLevels`	this argument is deprecated. Use `dropOmittedLevels`.

Details

The getData arguments and recode.sdf may be useful. (See Examples.) The correlation methods are calculated as described in the documentation for the wCorr package—see browseVignettes(package="wCorr").

When method is set to polyserial, all x arguments are assumed to be continuous and all y assumed discrete. Therefore, be mindful of variable selection as this may result in calculations taking a very long time to complete.

The Fisher Z-transformation is both a variance stabilizing and normalizing transformation for the Pearson correlation coefficient (Fisher, 1915). The transformation takes the inverse hyperbolic tangent of the correlation coefficients and then calculates all variances and confidence intervals. These are then transformed back to the correlation space (values between -1 and 1, inclusive) using the hyperbolic tangent function. The Taylor series approximation (or delta method) is applied for the standard errors.

Value

An edsurvey.cor that has print and summary methods.

The class includes the following elements:

`correlation`	numeric estimated correlation coefficient
`Zse`	standard error of the correlation (`Vimp` + `Vjrr`). In the case of Pearson, this is calculated in the linear atanh space and is not a standard error in the usual sense.
`correlates`	a vector of length two showing the columns for which the correlation coefficient was calculated
`variables`	`correlates` that are discrete
`order`	a list that shows the order of each variable
`method`	the type of correlation estimated
`Vjrr`	the jackknife component of the variance estimate. For Pearson, in the atanh space.
`Vimp`	the imputation component of the variance estimate. For Pearson, in the atanh space.
`weight`	the weight variable used
`npv`	the number of plausible values used
`njk`	the number of the jackknife replicates used
`n0`	the original number of observations
`nUsed`	the number of observations used in the analysis—after any conditions and any listwise deletion of missings is applied
`se`	the standard error of the correlation, in the correlation ([-1,1]) space
`ZconfidenceInterval`	the confidence interval of the correlation in the transformation space
`confidenceInterval`	the confidence interval of the correlation in the correlation ([-1,1]) space
`transformation`	the name of the transformation used when calculating standard errors

Author(s)

Paul Bailey; relies heavily on the wCorr package, written by Ahmad Emad and Paul Bailey

References

Fisher, R. A. (1915). Frequency distribution of the values of the correlation coefficient in samples from an indefinitely large population. Biometrika, 10(4), 507–521.

Examples

## Not run: 
# read in the example data (generated, not real student data)
sdf <- readNAEP(path=system.file("extdata/data", "M36NT2PM.dat", package = "NAEPprimer"))

# for two categorical variables any of the following work
c1_pears <- cor.sdf(x="b017451", y="b003501", data=sdf, method="Pearson",
                    weightVar="origwt")
c1_spear <- cor.sdf(x="b017451", y="b003501", data=sdf, method="Spearman",
                    weightVar="origwt")
c1_polyc <- cor.sdf(x="b017451", y="b003501", data=sdf, method="Polychoric",
                    weightVar="origwt")

c1_pears
c1_spear
c1_polyc

# for categorical variables, users can either keep the original numeric levels of the variables
# or condense the levels (default)
# the following call condenses the levels of the variable 'c046501'
cor.sdf(x="c046501", y="c044006", data=sdf)

# the following call keeps the original levels of the variable 'c046501'
cor.sdf(x="c046501", y="c044006", data=sdf, condenseLevels = FALSE)

# these take awhile to calculate for large datasets, so limit to a subset
sdf_dnf <- subset(sdf, b003601 == 1)

# for a categorical variable and a scale score any of the following work
c2_pears <- cor.sdf(x="composite", y="b017451", data=sdf_dnf, method="Pearson",
                    weightVar="origwt")
c2_spear <- cor.sdf(x="composite", y="b017451", data=sdf_dnf, method="Spearman",
                    weightVar="origwt")
c2_polys <- cor.sdf(x="composite", y="b017451", data=sdf_dnf, method="Polyserial",
                    weightVar="origwt")

c2_pears
c2_spear
c2_polys

# recode two variables
cor.sdf(x="c046501", y="c044006", data=sdf, method="Spearman", weightVar="origwt",
        recode=list(c046501=list(from="0%",to="None"),
                    c046501=list(from=c("1-5%", "6-10%", "11-25%", "26-50%",
                                        "51-75%", "76-90%", "Over 90%"),
                                 to="Between 0% and 100%"),
                    c044006=list(from=c("1-5%", "6-10%", "11-25%", "26-50%",
                                        "51-75%", "76-90%", "Over 90%"),
                                 to="Between 0% and 100%")))

# reorder two variables
cor.sdf(x="b017451", y="sdracem", data=sdf, method="Spearman", weightVar="origwt", 
        reorder=list(sdracem=c("White", "Hispanic", "Black", "Asian/Pacific Island",
                               "Amer Ind/Alaska Natv", "Other"),
                     b017451=c("Every day", "2 or 3 times a week", "About once a week",
                               "Once every few weeks", "Never or hardly ever")))

# recode two variables and reorder
cor.sdf(x="pared", y="b013801", data=subset(sdf, !pared %in% "I Don\'t Know"),
        method="Spearman", weightVar = "origwt",
        recode=list(pared=list(from="Some ed after H.S.", to="Graduated H.S."), 
                    pared=list(from="Graduated college", to="Graduated H.S."),
                    b013801=list(from="0-10", to="Less than 100"), 
                    b013801=list(from="11-25", to="Less than 100"),
                    b013801=list(from="26-100", to="Less than 100")),
        reorder=list(b013801=c("Less than 100", ">100")))

## End(Not run)

[Package EdSurvey version 4.0.7 Index]