fndistinct {collapse} | R Documentation |
Fast (Grouped) Distinct Value Count for Matrix-Like Objects
Description
fndistinct
is a generic function that (column-wise) computes the number of distinct values in x
, (optionally) grouped by g
. It is significantly faster than length(unique(x))
. The TRA
argument can further be used to transform x
using its (grouped) distinct value count.
Usage
fndistinct(x, ...)
## Default S3 method:
fndistinct(x, g = NULL, TRA = NULL, na.rm = .op[["na.rm"]],
use.g.names = TRUE, nthreads = .op[["nthreads"]], ...)
## S3 method for class 'matrix'
fndistinct(x, g = NULL, TRA = NULL, na.rm = .op[["na.rm"]],
use.g.names = TRUE, drop = TRUE, nthreads = .op[["nthreads"]], ...)
## S3 method for class 'data.frame'
fndistinct(x, g = NULL, TRA = NULL, na.rm = .op[["na.rm"]],
use.g.names = TRUE, drop = TRUE, nthreads = .op[["nthreads"]], ...)
## S3 method for class 'grouped_df'
fndistinct(x, TRA = NULL, na.rm = .op[["na.rm"]],
use.g.names = FALSE, keep.group_vars = TRUE, nthreads = .op[["nthreads"]], ...)
Arguments
x |
a vector, matrix, data frame or grouped data frame (class 'grouped_df'). |
g |
a factor, |
TRA |
an integer or quoted operator indicating the transformation to perform:
0 - "na" | 1 - "fill" | 2 - "replace" | 3 - "-" | 4 - "-+" | 5 - "/" | 6 - "%" | 7 - "+" | 8 - "*" | 9 - "%%" | 10 - "-%%". See |
na.rm |
logical. |
use.g.names |
logical. Make group-names and add to the result as names (default method) or row-names (matrix and data frame methods). No row-names are generated for data.table's. |
nthreads |
integer. The number of threads to utilize. Parallelism is across groups for grouped computations and at the column-level otherwise. |
drop |
matrix and data.frame method: Logical. |
keep.group_vars |
grouped_df method: Logical. |
... |
arguments to be passed to or from other methods. If |
Details
fndistinct
implements a pretty fast C-level hashing algorithm inspired by the kit package to find the number of distinct values.
If na.rm = TRUE
(the default), missing values will be skipped yielding substantial performance gains in data with many missing values. If na.rm = FALSE
, missing values will simply be treated as any other value and read into the hash-map. Thus with the former, a numeric vector c(1.25,NaN,3.56,NA)
will have a distinct value count of 2, whereas the latter will return a distinct value count of 4.
fndistinct
preserves all attributes of non-classed vectors / columns, and only the 'label' attribute (if available) of classed vectors / columns (i.e. dates or factors). When applied to data frames and matrices, the row-names are adjusted as necessary.
Value
Integer. The number of distinct values in x
, grouped by g
, or (if TRA
is used) x
transformed by its distinct value count, grouped by g
.
See Also
fnunique
, fnobs
, Fast Statistical Functions, Collapse Overview
Examples
## default vector method
fndistinct(airquality$Solar.R) # Simple distinct value count
fndistinct(airquality$Solar.R, airquality$Month) # Grouped distinct value count
## data.frame method
fndistinct(airquality)
fndistinct(airquality, airquality$Month)
fndistinct(wlddev) # Works with data of all types!
head(fndistinct(wlddev, wlddev$iso3c))
## matrix method
aqm <- qM(airquality)
fndistinct(aqm) # Also works for character or logical matrices
fndistinct(aqm, airquality$Month)
## method for grouped data frames - created with dplyr::group_by or fgroup_by
airquality |> fgroup_by(Month) |> fndistinct()
wlddev |> fgroup_by(country) |>
fselect(PCGDP,LIFEEX,GINI,ODA) |> fndistinct()