fnth-fmedian {collapse}R Documentation

Fast (Grouped, Weighted) N'th Element/Quantile for Matrix-Like Objects

Description

fnth (column-wise) returns the n'th smallest element from a set of unsorted elements x corresponding to an integer index (n), or to a probability between 0 and 1. If n is passed as a probability, ties can be resolved using the lower, upper, or average of the possible elements, or, since v1.9.0, continuous quantile estimation. The new default is quantile type 7 (as in quantile). For n > 1, the lower element is always returned (as in sort(x, partial = n)[n]). See Details.

fmedian is a simple wrapper around fnth, which fixes n = 0.5 and (default) ties = "mean" i.e. it averages eligible elements. See Details.

Usage

fnth(x, n = 0.5, ...)
fmedian(x, ...)

## Default S3 method:
fnth(x, n = 0.5, g = NULL, w = NULL, TRA = NULL, na.rm = .op[["na.rm"]],
     use.g.names = TRUE, ties = "q7", nthreads = .op[["nthreads"]],
     o = NULL, check.o = is.null(attr(o, "sorted")), ...)
## Default S3 method:
fmedian(x, ..., ties = "mean")

## S3 method for class 'matrix'
fnth(x, n = 0.5, g = NULL, w = NULL, TRA = NULL, na.rm = .op[["na.rm"]],
     use.g.names = TRUE, drop = TRUE, ties = "q7", nthreads = .op[["nthreads"]], ...)
## S3 method for class 'matrix'
fmedian(x, ..., ties = "mean")

## S3 method for class 'data.frame'
fnth(x, n = 0.5, g = NULL, w = NULL, TRA = NULL, na.rm = .op[["na.rm"]],
     use.g.names = TRUE, drop = TRUE, ties = "q7", nthreads = .op[["nthreads"]], ...)
## S3 method for class 'data.frame'
fmedian(x, ..., ties = "mean")

## S3 method for class 'grouped_df'
fnth(x, n = 0.5, w = NULL, TRA = NULL, na.rm = .op[["na.rm"]],
     use.g.names = FALSE, keep.group_vars = TRUE, keep.w = TRUE, stub = .op[["stub"]],
     ties = "q7", nthreads = .op[["nthreads"]], ...)
## S3 method for class 'grouped_df'
fmedian(x, w = NULL, TRA = NULL, na.rm = .op[["na.rm"]],
        use.g.names = FALSE, keep.group_vars = TRUE, keep.w = TRUE, stub = .op[["stub"]],
        ties = "mean", nthreads = .op[["nthreads"]], ...)

Arguments

x

a numeric vector, matrix, data frame or grouped data frame (class 'grouped_df').

n

the element to return using a single integer index such that 1 < n < NROW(x), or a probability 0 < n < 1. See Details.

g

a factor, GRP object, atomic vector (internally converted to factor) or a list of vectors / factors (internally converted to a GRP object) used to group x.

w

a numeric vector of (non-negative) weights, may contain missing values only where x is also missing.

TRA

an integer or quoted operator indicating the transformation to perform: 0 - "na" | 1 - "fill" | 2 - "replace" | 3 - "-" | 4 - "-+" | 5 - "/" | 6 - "%" | 7 - "+" | 8 - "*" | 9 - "%%" | 10 - "-%%". See TRA.

na.rm

logical. Skip missing values in x. Defaults to TRUE and implemented at very little computational cost. If na.rm = FALSE a NA is returned when encountered.

use.g.names

logical. Make group-names and add to the result as names (default method) or row-names (matrix and data frame methods). No row-names are generated for data.table's.

ties

an integer or character string specifying the method to resolve ties between adjacent qualifying elements:

Int. String Description
1 "mean" take the arithmetic mean of all qualifying elements.
2 "min" take the smallest of the elements.
3 "max" take the largest of the elements.
5-9 "qn" continuous quantile types 5-9, see fquantile.
nthreads

integer. The number of threads to utilize. Parallelism is across groups for grouped computations on vectors and data frames, and at the column-level otherwise. See Details.

o

integer. A valid ordering of x, e.g. radixorder(x). With groups, the grouping needs to be accounted e.g. radixorder(g, x).

check.o

logical. TRUE checks that each element of o is within [1, length(x)]. The default uses the fact that orderings from radixorder have a "sorted" attribute which let's fnth infer that the ordering is valid. The length and data type of o is always checked, regardless of check.o.

drop

matrix and data.frame method: Logical. TRUE drops dimensions and returns an atomic vector if g = NULL and TRA = NULL.

keep.group_vars

grouped_df method: Logical. FALSE removes grouping variables after computation.

keep.w

grouped_df method: Logical. Retain sum of weighting variable after computation (if contained in grouped_df).

stub

character. If keep.w = TRUE and stub = TRUE (default), the summed weights column is prefixed by "sum.". Users can specify a different prefix through this argument, or set it to FALSE to avoid prefixing.

...

for fmedian: further arguments passed to fnth (apart from n). If TRA is used, passing set = TRUE will transform data by reference and return the result invisibly.

Details

For v1.9.0 fnth was completely rewritten in C and offers significantly enhanced speed and functionality. It uses a combination of quickselect, quicksort, and radixsort algorithms, combined with several (weighted) quantile estimation methods and, where possible, OpenMP multithreading. This synthesis can be summarised as follows:

If n > 1, the result is equivalent to (column-wise) sort(x, partial = n)[n]. Internally, n is converted to a probability using p = (n-1)/(NROW(x)-1), and that probability is applied to the set of non-missing elements to find the as.integer(p*(fnobs(x)-1))+1L'th element (which corresponds to option ties = "min"). When using grouped computations with n > 1, n is transformed to a probability p = (n-1)/(NROW(x)/ng-1) (where ng contains the number of unique groups in g).

If weights are used and ties = "q5"-"q9", weighted continuous quantile estimation is done as described in fquantile.

For ties %in% c("mean", "min", "max"), a target partial sum of weights p*sum(w) is calculated, and the weighted n'th element is the element k such that all elements smaller than k have a sum of weights <= p*sum(w), and all elements larger than k have a sum of weights <= (1 - p)*sum(w). If the partial-sum of weights (p*sum(w)) is reached exactly for some element k, then (summing from the lower end) both k and k+1 would qualify as the weighted n'th element. If the weight of element k+1 is zero, k, k+1 and k+2 would qualify... . If n > 1, k is chosen (consistent with the unweighted behavior). If 0 < n < 1, the ties option regulates how to resolve such conflicts, yielding lower (ties = "min": k), upper (ties = "max": k+2) or average weighted (ties = "mean": mean(k, k+1, k+2)) n'th elements.

Thus, in the presence of zero weights, the weighted median (default ties = "mean") can be an arithmetic average of >2 qualifying elements. Users may prefer a quantile based weighted median by setting ties = "q5"-"q9", which is a continuous function of p and ignores elements with zero weights.

For data frames, column-attributes and overall attributes are preserved if g is used or drop = FALSE.

Value

The (w weighted) n'th element/quantile of x, grouped by g, or (if TRA is used) x transformed by its (grouped, weighted) n'th element/quantile.

See Also

fquantile, fmean, fmode, Fast Statistical Functions, Collapse Overview

Examples

## default vector method
mpg <- mtcars$mpg
fnth(mpg)                         # Simple nth element: Median (same as fmedian(mpg))
fnth(mpg, 5)                      # 5th smallest element
sort(mpg, partial = 5)[5]         # Same using base R, fnth is 2x faster.
fnth(mpg, 0.75)                   # Third quartile
fnth(mpg, 0.75, w = mtcars$hp)    # Weighted third quartile: Weighted by hp
fnth(mpg, 0.75, TRA = "-")        # Simple transformation: Subtract third quartile
fnth(mpg, 0.75, mtcars$cyl)             # Grouped third quartile
fnth(mpg, 0.75, mtcars[c(2,8:9)])       # More groups..
g <- GRP(mtcars, ~ cyl + vs + am)       # Precomputing groups gives more speed !
fnth(mpg, 0.75, g)
fnth(mpg, 0.75, g, mtcars$hp)           # Grouped weighted third quartile
fnth(mpg, 0.75, g, TRA = "-")           # Groupwise subtract third quartile
fnth(mpg, 0.75, g, mtcars$hp, "-")      # Groupwise subtract weighted third quartile

## data.frame method
fnth(mtcars, 0.75)
head(fnth(mtcars, 0.75, TRA = "-"))
fnth(mtcars, 0.75, g)
fnth(fgroup_by(mtcars, cyl, vs, am), 0.75)   # Another way of doing it..
fnth(mtcars, 0.75, g, use.g.names = FALSE)   # No row-names generated

## matrix method
m <- qM(mtcars)
fnth(m, 0.75)
head(fnth(m, 0.75, TRA = "-"))
fnth(m, 0.75, g) # etc..

## method for grouped data frames - created with dplyr::group_by or fgroup_by
mtcars |> fgroup_by(cyl,vs,am) |> fnth(0.75)
mtcars |> fgroup_by(cyl,vs,am) |> fnth(0.75, hp)         # Weighted
mtcars |> fgroup_by(cyl,vs,am) |> fnth(0.75, TRA = "/")  # Divide by third quartile
mtcars |> fgroup_by(cyl,vs,am) |> fselect(mpg, hp) |>    # Faster selecting
      fnth(0.75, hp, "/")  # Divide mpg by its third weighted group-quartile, using hp as weights

# Efficient grouped estimation of multiple quantiles
mtcars |> fgroup_by(cyl,vs,am) |>
    fmutate(o = radixorder(GRPid(), mpg)) |>
    fsummarise(mpg_Q1 = fnth(mpg, 0.25, o = o),
               mpg_median = fmedian(mpg, o = o),
               mpg_Q3 = fnth(mpg, 0.75, o = o))

## fmedian()
fmedian(mpg)                         # Simple median value
fmedian(mpg, w = mtcars$hp)          # Weighted median: Weighted by hp
fmedian(mpg, TRA = "-")              # Simple transformation: Subtract median value
fmedian(mpg, mtcars$cyl)             # Grouped median value
fmedian(mpg, mtcars[c(2,8:9)])       # More groups..
fmedian(mpg, g)
fmedian(mpg, g, mtcars$hp)           # Grouped weighted median
fmedian(mpg, g, TRA = "-")           # Groupwise subtract median value
fmedian(mpg, g, mtcars$hp, "-")      # Groupwise subtract weighted median value

## data.frame method
fmedian(mtcars)
head(fmedian(mtcars, TRA = "-"))
fmedian(mtcars, g)
fmedian(fgroup_by(mtcars, cyl, vs, am))   # Another way of doing it..
fmedian(mtcars, g, use.g.names = FALSE)   # No row-names generated

## matrix method
fmedian(m)
head(fmedian(m, TRA = "-"))
fmedian(m, g) # etc..

## method for grouped data frames - created with dplyr::group_by or fgroup_by
mtcars |> fgroup_by(cyl,vs,am) |> fmedian()
mtcars |> fgroup_by(cyl,vs,am) |> fmedian(hp)           # Weighted
mtcars |> fgroup_by(cyl,vs,am) |> fmedian(TRA = "-")    # De-median
mtcars |> fgroup_by(cyl,vs,am) |> fselect(mpg, hp) |>   # Faster selecting
      fmedian(hp, "-")  # Weighted de-median mpg, using hp as weights

[Package collapse version 2.0.13 Index]