scale_by {standardize} | R Documentation |
Center and scale a continuous variable conditioning on factors.
Description
scale_by
centers and scales a numeric variable within each level
of a factor (or the interaction of several factors).
Usage
scale_by(object = NULL, data = NULL, scale = 1)
Arguments
object |
A |
data |
A data.frame containing the numeric variable to be scaled and the factors to condition on. |
scale |
Numeric (default 1). The desired standard deviation for the
numeric variable within-factor-level. If the numeric variable is a matrix,
then |
Details
First, the behavior when object
is a formula and scale = 1
is described.
The left hand side of the formula must indicate a numeric variable
to be scaled. The full interaction of the variables on the right hand side
of the formula is taken as the factor to condition scaling on (i.e.
it doesn't matter whether they are separated with +
, :
, or
*
). For the remainder of this section, the numeric variable will
be referred to as x
and the full factor interaction term will be
referred to as facs
.
First, if facs
has more than one element, then a new factor is
created as their full interaction term. When a factor has NA
values,
NA
is treated as a level. For each level of the factor which has
at least two unique non-NA
x
values, the mean of x
is recorded as the level's center and the standard deviation of x
is recorded as the level's scale. The mean of these
centers is recorded as new_center
and the mean of these scales
is recorded as new_scale
, and new_center
and
new_scale
are used as the center and scale for factor levels with
fewer than two unique non-NA
x
values. Then for each level of
the factor, the level's center is subtracted from its x
values, and
the result is divided by the level's scale.
The result is that any level with more than two unique non-NA
x
values now has mean 0
and standard deviation 1
, and levels
with fewer than two are placed on a similar scale (though their standard
deviation is undefined). Note that the overall standard deviation of the
resulting variable (or standard deviations if x
is a matrix) will not
be exactly 1
(but will be close). The interpretation of the
variable is how far an observation is from its level's average value for
x
in terms of within-level standard deviations.
If scale = 0
, then only centering (but not scaling) is performed.
If scale
is neither 0
nor 1
, then x
is scaled
such that the standard deviation within-level is scale
. Note that
this is different than the scale
argument to scale
which specifies the number the centered variable is divided by (which is
the inverse of the use here). If x
is a matrix with more than
one column, then scale
must either be a vector with an element for
each column of x
or a single number which will be used for all
columns. If any element of scale
is 0
, then all elements are
treated as 0
. No element in scale
can be negative.
If object
is not a formula, it must be a numeric variable which
resulted from a previous scale_by
call, or the pred
attribute
of such a numeric variable. In this case, scale
is ignored, and x
in data
is scaled
using the formula
, centers
and scales
in object
(with new levels treated using new_center
and new_scale
).
Value
A numeric variable which is conditionally scaled within each level
of the conditioning factor(s), with standard deviation scale
. It has
an additional class scaledby
, as well as an attribute
pred
with class scaledby_pred
, which is a list containing
the formula, the centers and scales for known factor levels, and the
center and scale to be applied to new factor levels. The variable returned
can be used as the object
argument in future calls to
scale_by
, as can its pred
attribute.
Author(s)
Christopher D. Eager <eager.stats@gmail.com>
See Also
Examples
dat <- data.frame(
f1 = rep(c("a", "b", "c"), c(5, 10, 20)),
x1 = rnorm(35, rep(c(1, 2, 3), c(5, 10, 20)),
rep(c(.5, 1.5, 3), c(5, 10, 20))))
dat$x1_scaled <- scale(dat$x1)
dat$x1_scaled_by_f1 <- scale_by(x1 ~ f1, dat)
mean(dat$x1)
sd(dat$x1)
with(dat, tapply(x1, f1, mean))
with(dat, tapply(x1, f1, sd))
mean(dat$x1_scaled)
sd(dat$x1_scaled)
with(dat, tapply(x1_scaled, f1, mean))
with(dat, tapply(x1_scaled, f1, sd))
mean(dat$x1_scaled_by_f1)
sd(dat$x1_scaled_by_f1)
with(dat, tapply(x1_scaled_by_f1, f1, mean))
with(dat, tapply(x1_scaled_by_f1, f1, sd))
newdata <- data.frame(
f1 = c("a", "b", "c", "d"),
x1 = rep(1, 4))
newdata$x1_pred_scaledby <- scale_by(dat$x1_scaled_by_f1, newdata)
newdata