R: Fast (Grouped) Distinct Value Count for Matrix-Like Objects

fndistinct {collapse}

R Documentation

Fast (Grouped) Distinct Value Count for Matrix-Like Objects

Description

fndistinct is a generic function that (column-wise) computes the number of distinct values in x, (optionally) grouped by g. It is significantly faster than length(unique(x)). The TRA argument can further be used to transform x using its (grouped) distinct value count.

Usage

fndistinct(x, ...)

## Default S3 method:
fndistinct(x, g = NULL, TRA = NULL, na.rm = .op[["na.rm"]],
           use.g.names = TRUE, nthreads = .op[["nthreads"]], ...)

## S3 method for class 'matrix'
fndistinct(x, g = NULL, TRA = NULL, na.rm = .op[["na.rm"]],
           use.g.names = TRUE, drop = TRUE, nthreads = .op[["nthreads"]], ...)

## S3 method for class 'data.frame'
fndistinct(x, g = NULL, TRA = NULL, na.rm = .op[["na.rm"]],
           use.g.names = TRUE, drop = TRUE, nthreads = .op[["nthreads"]], ...)

## S3 method for class 'grouped_df'
fndistinct(x, TRA = NULL, na.rm = .op[["na.rm"]],
           use.g.names = FALSE, keep.group_vars = TRUE, nthreads = .op[["nthreads"]], ...)

Arguments

`x`	a vector, matrix, data frame or grouped data frame (class 'grouped_df').
`g`	a factor, `GRP` object, atomic vector (internally converted to factor) or a list of vectors / factors (internally converted to a `GRP` object) used to group `x`.
`TRA`	an integer or quoted operator indicating the transformation to perform: 0 - "na" \| 1 - "fill" \| 2 - "replace" \| 3 - "-" \| 4 - "-+" \| 5 - "/" \| 6 - "%" \| 7 - "+" \| 8 - "*" \| 9 - "%%" \| 10 - "-%%". See `TRA`.
`na.rm`	logical. `TRUE`: Skip missing values in `x` (faster computation). `FALSE`: Also consider 'NA' as one distinct value.
`use.g.names`	logical. Make group-names and add to the result as names (default method) or row-names (matrix and data frame methods). No row-names are generated for data.table's.
`nthreads`	integer. The number of threads to utilize. Parallelism is across groups for grouped computations and at the column-level otherwise.
`drop`	matrix and data.frame method: Logical. `TRUE` drops dimensions and returns an atomic vector if `g = NULL` and `TRA = NULL`.
`keep.group_vars`	grouped_df method: Logical. `FALSE` removes grouping variables after computation.
`...`	arguments to be passed to or from other methods. If `TRA` is used, passing `set = TRUE` will transform data by reference and return the result invisibly.

Details

fndistinct implements a pretty fast C-level hashing algorithm inspired by the kit package to find the number of distinct values.

If na.rm = TRUE (the default), missing values will be skipped yielding substantial performance gains in data with many missing values. If na.rm = FALSE, missing values will simply be treated as any other value and read into the hash-map. Thus with the former, a numeric vector c(1.25,NaN,3.56,NA) will have a distinct value count of 2, whereas the latter will return a distinct value count of 4.

fndistinct preserves all attributes of non-classed vectors / columns, and only the 'label' attribute (if available) of classed vectors / columns (i.e. dates or factors). When applied to data frames and matrices, the row-names are adjusted as necessary.

Value

Integer. The number of distinct values in x, grouped by g, or (if TRA is used) x transformed by its distinct value count, grouped by g.

Examples

## default vector method
fndistinct(airquality$Solar.R)                   # Simple distinct value count
fndistinct(airquality$Solar.R, airquality$Month) # Grouped distinct value count

## data.frame method
fndistinct(airquality)
fndistinct(airquality, airquality$Month)
fndistinct(wlddev)                               # Works with data of all types!
head(fndistinct(wlddev, wlddev$iso3c))

## matrix method
aqm <- qM(airquality)
fndistinct(aqm)                                  # Also works for character or logical matrices
fndistinct(aqm, airquality$Month)

## method for grouped data frames - created with dplyr::group_by or fgroup_by
airquality |> fgroup_by(Month) |> fndistinct()
wlddev |> fgroup_by(country) |>
             fselect(PCGDP,LIFEEX,GINI,ODA) |> fndistinct()

[Package collapse version 2.0.15 Index]