scale_by {standardize}R Documentation

Center and scale a continuous variable conditioning on factors.

Description

scale_by centers and scales a numeric variable within each level of a factor (or the interaction of several factors).

Usage

scale_by(object = NULL, data = NULL, scale = 1)

Arguments

object

A formula whose left hand side indicates a numeric variable to be scaled and whose right hand side indicates factors to condition this scaling on; or the result of a previous call to scale_by or the pred attribute of a previous call. See 'Details'.

data

A data.frame containing the numeric variable to be scaled and the factors to condition on.

scale

Numeric (default 1). The desired standard deviation for the numeric variable within-factor-level. If the numeric variable is a matrix, then scale must have either one element (used for all columns), or as many elements as there are columns in the numeric variable. To center the numeric variable without scaling, set scale to 0. See 'Details'.

Details

First, the behavior when object is a formula and scale = 1 is described. The left hand side of the formula must indicate a numeric variable to be scaled. The full interaction of the variables on the right hand side of the formula is taken as the factor to condition scaling on (i.e. it doesn't matter whether they are separated with +, :, or *). For the remainder of this section, the numeric variable will be referred to as x and the full factor interaction term will be referred to as facs.

First, if facs has more than one element, then a new factor is created as their full interaction term. When a factor has NA values, NA is treated as a level. For each level of the factor which has at least two unique non-NA x values, the mean of x is recorded as the level's center and the standard deviation of x is recorded as the level's scale. The mean of these centers is recorded as new_center and the mean of these scales is recorded as new_scale, and new_center and new_scale are used as the center and scale for factor levels with fewer than two unique non-NA x values. Then for each level of the factor, the level's center is subtracted from its x values, and the result is divided by the level's scale. The result is that any level with more than two unique non-NA x values now has mean 0 and standard deviation 1, and levels with fewer than two are placed on a similar scale (though their standard deviation is undefined). Note that the overall standard deviation of the resulting variable (or standard deviations if x is a matrix) will not be exactly 1 (but will be close). The interpretation of the variable is how far an observation is from its level's average value for x in terms of within-level standard deviations.

If scale = 0, then only centering (but not scaling) is performed. If scale is neither 0 nor 1, then x is scaled such that the standard deviation within-level is scale. Note that this is different than the scale argument to scale which specifies the number the centered variable is divided by (which is the inverse of the use here). If x is a matrix with more than one column, then scale must either be a vector with an element for each column of x or a single number which will be used for all columns. If any element of scale is 0, then all elements are treated as 0. No element in scale can be negative.

If object is not a formula, it must be a numeric variable which resulted from a previous scale_by call, or the pred attribute of such a numeric variable. In this case, scale is ignored, and x in data is scaled using the formula, centers and scales in object (with new levels treated using new_center and new_scale).

Value

A numeric variable which is conditionally scaled within each level of the conditioning factor(s), with standard deviation scale. It has an additional class scaledby, as well as an attribute pred with class scaledby_pred, which is a list containing the formula, the centers and scales for known factor levels, and the center and scale to be applied to new factor levels. The variable returned can be used as the object argument in future calls to scale_by, as can its pred attribute.

Author(s)

Christopher D. Eager <eager.stats@gmail.com>

See Also

scale.

Examples

dat <- data.frame(
  f1 = rep(c("a", "b", "c"), c(5, 10, 20)),
  x1 = rnorm(35, rep(c(1, 2, 3), c(5, 10, 20)),
    rep(c(.5, 1.5, 3), c(5, 10, 20))))

dat$x1_scaled <- scale(dat$x1)
dat$x1_scaled_by_f1 <- scale_by(x1 ~ f1, dat)

mean(dat$x1)
sd(dat$x1)
with(dat, tapply(x1, f1, mean))
with(dat, tapply(x1, f1, sd))

mean(dat$x1_scaled)
sd(dat$x1_scaled)
with(dat, tapply(x1_scaled, f1, mean))
with(dat, tapply(x1_scaled, f1, sd))

mean(dat$x1_scaled_by_f1)
sd(dat$x1_scaled_by_f1)
with(dat, tapply(x1_scaled_by_f1, f1, mean))
with(dat, tapply(x1_scaled_by_f1, f1, sd))

newdata <- data.frame(
  f1 = c("a", "b", "c", "d"),
  x1 = rep(1, 4))

newdata$x1_pred_scaledby <- scale_by(dat$x1_scaled_by_f1, newdata)

newdata

[Package standardize version 0.2.2 Index]