R: Apply Functions Across Multiple Columns

across {collapse}

R Documentation

Apply Functions Across Multiple Columns

Description

across() can be used inside fmutate and fsummarise to apply one or more functions to a selection of columns. It is overall very similar to dplyr::across, but does not support some rlang features, has some additional features (arguments), and is optimized to work with collapse's, .FAST_FUN, yielding much faster computations.

Usage

across(.cols = NULL, .fns, ..., .names = NULL,
       .apply = "auto", .transpose = "auto")

# acr(...) can be used to abbreviate across(...)

Arguments

`.cols`	select columns using column names and expressions (e.g. `a:b` or `c(a, b, c:f)`), column indices, logical vectors, or functions yielding a logical value e.g. `is.numeric`. `NULL` applies functions to all columns except for grouping columns.
`.fns`	A function, character vector of functions or list of functions. Vectors / lists can be named to yield alternative names in the result (see `.names`). This argument is evaluated inside `substitute()`, and the content (not the names of vectors/lists) is checked against `.FAST_FUN` and `.OPERATOR_FUN`. Matching functions receive vectorized execution, other functions are applied to the data in a standard way.
`...`	further arguments to `.fns`. Arguments are evaluated in the data environment and split by groups as well (for non-vectorized functions, if of the same length as the data).
`.names`	controls the naming of computed columns. `NULL` generates names of the form `coli_funj` if multiple functions are used. `.names = TRUE` enables this for a single function, `.names = FALSE` disables it for multiple functions (sensible for functions such as `.OPERATOR_FUN` that rename columns (if `.apply = FALSE`)). Setting `.names = "flip"` generates names of the form `funj_coli`. It is also possible to supply a function with two arguments for column and function names e.g. `function(c, f) paste0(f, "_", c)`. Finally, you can supply a custom vector of names which must match `length(.cols) * length(.fns)`.
`.apply`	controls whether functions are applied column-by-column (`TRUE`) or to multiple columns at once (`FALSE`). The default, `"auto"`, does the latter for vectorized functions, which have an efficient data frame method. It can also be sensible to use `.apply = FALSE` for non-vectorized functions, especially multivariate functions like `lm` or `pwcor`, or functions renaming the data. See Examples.
`.transpose`	with multiple `.fns`, `.transpose` controls whether the result is ordered first by column, then by function (`TRUE`), or vice-versa (`FALSE`). `"auto"` does the former if all functions yield results of the same dimensions (dimensions may differ if `.apply = FALSE`). See Examples.

Note

across does not support purr-style lambdas, and does not support dplyr-style predicate functions e.g. across(where(is.numeric), sum), simply use across(is.numeric, sum). In contrast to dplyr, you can also compute on grouping columns.

Examples

# Basic (Weighted) Summaries
fsummarise(wlddev, across(PCGDP:GINI, fmean, w = POP))

wlddev |> fgroup_by(region, income) |>
    fsummarise(across(PCGDP:GINI, fmean, w = POP))

# Note that for these we don't actually need across...
fselect(wlddev, PCGDP:GINI) |> fmean(w = wlddev$POP, drop = FALSE)
wlddev |> fgroup_by(region, income) |>
    fselect(PCGDP:GINI, POP) |> fmean(POP, keep.w = FALSE)
collap(wlddev, PCGDP + LIFEEX + GINI ~ region + income, w = ~ POP, keep.w = FALSE)

# But if we want to use some base R function that reguires argument splitting...
wlddev |> na_omit(cols = "POP") |> fgroup_by(region, income) |>
    fsummarise(across(PCGDP:GINI, weighted.mean, w = POP, na.rm = TRUE))

# Or if we want to apply different functions...
wlddev |> fgroup_by(region, income) |>
    fsummarise(across(PCGDP:GINI, list(mu = fmean, sd = fsd), w = POP),
               POP_sum = fsum(POP), OECD = fmean(OECD))
# Note that the above still detects fmean as a fast function, the names of the list
# are irrelevant, but the function name must be typed or passed as a character vector,
# Otherwise functions will be executed by groups e.g. function(x) fmean(x) won't vectorize

# Same, naming in a different way
wlddev |> fgroup_by(region, income) |>
    fsummarise(across(PCGDP:GINI, list(mu = fmean, sd = fsd), w = POP, .names = "flip"),
               sum_POP = fsum(POP), OECD = fmean(OECD))

# Or we want to do more advanced things..
# Such as nesting data frames..
qTBL(wlddev) |> fgroup_by(region, income) |>
    fsummarise(across(c(PCGDP, LIFEEX, ODA),
               function(x) list(Nest = list(x)),
               .apply = FALSE))
# Or linear models..
qTBL(wlddev) |> fgroup_by(region, income) |>
    fsummarise(across(c(PCGDP, LIFEEX, ODA),
               function(x) list(Mods = list(lm(PCGDP ~., x))),
               .apply = FALSE))
# Or cumputing grouped correlation matrices
qTBL(wlddev) |> fgroup_by(region, income) |>
    fsummarise(across(c(PCGDP, LIFEEX, ODA),
      function(x) qDF(pwcor(x), "Variable"), .apply = FALSE))

# Here calculating 1- and 10-year lags and growth rates of these variables
qTBL(wlddev) |> fgroup_by(country) |>
    fmutate(across(c(PCGDP, LIFEEX, ODA), list(L, G),
                   n = c(1, 10), t = year, .names = FALSE))

# Same but variables in different order
qTBL(wlddev) |> fgroup_by(country) |>
    fmutate(across(c(PCGDP, LIFEEX, ODA), list(L, G), n = c(1, 10),
                   t = year, .names = FALSE, .transpose = FALSE))