categorize {datawizard} | R Documentation |
Recode (or "cut" / "bin") data into groups of values.
Description
This functions divides the range of variables into intervals and recodes
the values inside these intervals according to their related interval.
It is basically a wrapper around base R's cut()
, providing a simplified
and more accessible way to define the interval breaks (cut-off values).
Usage
categorize(x, ...)
## S3 method for class 'numeric'
categorize(
x,
split = "median",
n_groups = NULL,
range = NULL,
lowest = 1,
labels = NULL,
verbose = TRUE,
...
)
## S3 method for class 'data.frame'
categorize(
x,
select = NULL,
exclude = NULL,
split = "median",
n_groups = NULL,
range = NULL,
lowest = 1,
labels = NULL,
append = FALSE,
ignore_case = FALSE,
regex = FALSE,
verbose = TRUE,
...
)
Arguments
x |
A (grouped) data frame, numeric vector or factor. |
... |
not used. |
split |
Character vector, indicating at which breaks to split variables,
or numeric values with values indicating breaks. If character, may be one
of |
n_groups |
If |
range |
If |
lowest |
Minimum value of the recoded variable(s). If |
labels |
Character vector of value labels. If not |
verbose |
Toggle warnings. |
select |
Variables that will be included when performing the required tasks. Can be either
If |
exclude |
See |
append |
Logical or string. If |
ignore_case |
Logical, if |
regex |
Logical, if |
Value
x
, recoded into groups. By default x
is numeric, unless labels
is specified. In this case, a factor is returned, where the factor levels
(i.e. recoded groups are labelled accordingly.
Splits and breaks (cut-off values)
Breaks are in general exclusive, this means that these values indicate
the lower bound of the next group or interval to begin. Take a simple
example, a numeric variable with values from 1 to 9. The median would be 5,
thus the first interval ranges from 1-4 and is recoded into 1, while 5-9
would turn into 2 (compare cbind(1:9, categorize(1:9))
). The same variable,
using split = "quantile"
and n_groups = 3
would define breaks at 3.67
and 6.33 (see quantile(1:9, probs = c(1/3, 2/3))
), which means that values
from 1 to 3 belong to the first interval and are recoded into 1 (because
the next interval starts at 3.67), 4 to 6 into 2 and 7 to 9 into 3.
Recoding into groups with equal size or range
split = "equal_length"
and split = "equal_range"
try to divide the
range of x
into intervals of similar (or same) length. The difference is
that split = "equal_length"
will divide the range of x
into n_groups
pieces and thereby defining the intervals used as breaks (hence, it is
equivalent to cut(x, breaks = n_groups)
), while split = "equal_range"
will cut x
into intervals that all have the length of range
, where the
first interval by defaults starts at 1
. The lowest (or starting) value
of that interval can be defined using the lowest
argument.
Selection of variables - the select
argument
For most functions that have a select
argument (including this function),
the complete input data frame is returned, even when select
only selects
a range of variables. That is, the function is only applied to those variables
that have a match in select
, while all other variables remain unchanged.
In other words: for this function, select
will not omit any non-included
variables, so that the returned data frame will include all variables
from the input data frame.
See Also
Functions to rename stuff:
data_rename()
,data_rename_rows()
,data_addprefix()
,data_addsuffix()
Functions to reorder or remove columns:
data_reorder()
,data_relocate()
,data_remove()
Functions to reshape, pivot or rotate data frames:
data_to_long()
,data_to_wide()
,data_rotate()
Functions to recode data:
rescale()
,reverse()
,categorize()
,recode_values()
,slide()
Functions to standardize, normalize, rank-transform:
center()
,standardize()
,normalize()
,ranktransform()
,winsorize()
Split and merge data frames:
data_partition()
,data_merge()
Functions to find or select columns:
data_select()
,extract_column_names()
Functions to filter rows:
data_match()
,data_filter()
Examples
set.seed(123)
x <- sample(1:10, size = 50, replace = TRUE)
table(x)
# by default, at median
table(categorize(x))
# into 3 groups, based on distribution (quantiles)
table(categorize(x, split = "quantile", n_groups = 3))
# into 3 groups, user-defined break
table(categorize(x, split = c(3, 5)))
set.seed(123)
x <- sample(1:100, size = 500, replace = TRUE)
# into 5 groups, try to recode into intervals of similar length,
# i.e. the range within groups is the same for all groups
table(categorize(x, split = "equal_length", n_groups = 5))
# into 5 groups, try to return same range within groups
# i.e. 1-20, 21-40, 41-60, etc. Since the range of "x" is
# 1-100, and we have a range of 20, this results into 5
# groups, and thus is for this particular case identical
# to the previous result.
table(categorize(x, split = "equal_range", range = 20))
# return factor with value labels instead of numeric value
set.seed(123)
x <- sample(1:10, size = 30, replace = TRUE)
categorize(x, "equal_length", n_groups = 3)
categorize(x, "equal_length", n_groups = 3, labels = c("low", "mid", "high"))
# cut numeric into groups with the mean or median as a label name
x <- sample(1:10, size = 30, replace = TRUE)
categorize(x, "equal_length", n_groups = 3, labels = "mean")
categorize(x, "equal_length", n_groups = 3, labels = "median")