find_pbm_diff {diffman}R Documentation

Perform all the process to detect risky observations

Description

Allow from a table of observations for which there are two different nomenclatures (z1 and z2) to determine the observations at risk when using the differentiation technique

Usage

find_pbm_diff(
  t_ind,
  threshold,
  max_agregate_size,
  save_file = NULL,
  simplify = TRUE,
  verbose = TRUE
)

Arguments

t_ind

The table of observations (data.frame or data.table). Each row correspond to an observtion and for each observation we must know in which category of the z1 nomenclature it belongs and in which category of the z2 nomenclature.

threshold

Strictly positive integer indicating the confidentiality threshold. Observations are considered at risk if one can deduce information on a agregate of n observations where n < threshold.

max_agregate_size

Integer indicating the maximal size of agregates which are tested exhaustively. If that number is too large (greater than 30), the computations may not end because of the combinations number that can become very large. Also the RAM can be overloaded.

save_file

Character indicating the suffix of the name of the saved results. If is null, results are not writing on the hardware. The path root is taken from the working directory (getwd()).

simplify

Boolean. If TRUE then the graph simplification (merging + splitting) occures. Otherwise the exhaustive search is directly applied on the original graph.

verbose

Boolean. If TRUE (default), the different steps of the process are notified and progress bars provide an estimation of time left.

Details

Risky observations because of differentiation are the ones for which information can be deduced on agregates smaller than the confidentiality threshold. For example, considering the confidentiality threshold is 10 and if by making the difference between some categories of z1 and some categories of z2 one can deduce the value of a variable for 5 observations, then those 5 observations are considered as "risky".

Value

As an output there is a data.table or data.frame with five columns :

  1. $id_obs for the observation at risk

  2. $agregat for the agregate of categories from z1 nomenclature on which the differentiation is performed

  3. $agregat_size indicating the number of categories composing the agregate

  4. $nb_obs the number of observations on which information is deduced when the differentiation is computed (nb_obs must be stricly inferior to $threshold)

  5. $type_diff the type of differentiation between "internal" or "external".

Examples

res_diff <- find_pbm_diff(t_ex,threshold = 5,max_agregate_size = 15)


[Package diffman version 0.1.1 Index]