clsTupleFreqs {cdparcoord}R Documentation

Compute/display tuple frequency counts, and optionally account for NA values


The functions tupleFreqs and discparcoord are the workhorse functions in the package, calculating frequency counts to be used in the graphs and displaying them.


    clsTupleFreqs(cls=NULL, dataset, k=5, NAexp=1, countNAs=FALSE)
    discparcoord(data, k=5, grpcategory=NULL, permute=FALSE,
        interactive = TRUE, save=FALSE, name="Parcoords", labelsOff=TRUE,
        NAexp=1.0,countNAs=FALSE, accentuate=NULL, accval=100, inParallel=FALSE,
        cls=NULL, differentiate=FALSE, saveCounts=FALSE, minFreq=NULL)



The data, in data frame or matrix form.


The number of tuples to return. These will be the k most frequent tuples, unless k is negative, in which case the least-frequent tuples will be returned. The latter is useful for hunting for outliers.


Grouping column/variable.


If TRUE, randomly permute the columns before plotting.


If TRUE, use interactive plotting, allowing for interactively readjusting column order and scrubbing/brushing.


If this is TRUE and interactive mode is on, saved plot will be available from the browser.


The name for the plot.


If TRUE, labels are off. This only comes into effect when interactive=FALSE.


Scale for NA counts.


If TRUE, count NA values.


Character expression specifying the property to accentuate.


Value to accentuate.


If TRUE, calculate tuple frequencies in parallel.


If TRUE, randomize coloring to differentiate overlapping lines.


If TRUE, save the tuple counts to the file ‘tupleCounts’.


The smallest frequency to be displayed.


The dataset to process, a data frame or data.table.


Cluster to be used if inParallel is TRUE. If inParallel is TRUE and cls is not supplied, it will use the sensed number of cores on the calling machine by default.


Tuple tabulation is performed by tupleFreqs, or in large cases, in parallel by clsTupleFreqs. The display is done by discparcoord.

The k most- or least-frequent tuples will be reported, with the latter specified via negative k. Optionally, tuples with NA values will count less, but weigh toward everything that has existing numbers in common with it.

If continuous variables are present, then in most cases, either convert to discrete using discretize or use freqparcoord.

The data will be converted into a data.table if it is not already in that form. For this and other reasons, it is advantageous to have the data in that form to begin with, say by using data.table::fread to read the data.

Optionally, tuples that partially match a full tuple pattern except for NA values will add a partial count to the frequency count for the full pattern. If for instance the data consist of 8-tuples and a row in the data matches a given 8-tuple pattern in 7 of 8 components, this row would add a count of 7/8 to the frequency for that pattern. To reduce this weight, use a value greater than 1.0 for NAexp. If that value is 2, for example, the 7/8 increment will be 7/8 squared.


The functions tupleFreqs and clsTupleFreqs return an object of class c('pna','data.frame'), with each row consisting of a tuple and its count. In addition the object will have attributes k and minFreq.

The function discparcoord returns an object of class c('plotly','htmlwidget'). Printing the object causes display of the graph.


Norm Matloff <>, Vincent Yang <>, and Harrison Nguyen <>


   ## Not run: 
       # Find frequencies in parallel
       discparcoord(Titanic, inParallel=TRUE)
## End(Not run)

    ## Not run: 
       input1 = list("name" = "average_montly_hours",
                     "partitions" = 3, "labels" = c("low", "med", "high"))
       input = list(input1)
       # this will discretize the data by partitioning average monthly 
       # hours into 3 parts called low, med, and high
       hrdata = discretize(hrdata, input)
       print('first few discretized tuples')
       # first line should be 0.38,0.53,2,low,3,0,1,00,sales,low
       print('first few most-frequent tuples')
       # first line should be 0.40,0.46,2,...,11
       # account for NA values and plot with parallel coordinates
       # same as above, but with scrambled columns
       discparcoord(hrdata, permute=TRUE)
       # same as above, but show top k values
       discparcoord(hrdata, k=8)
       # same as above, but group according to profession
       discparcoord(hrdata, grpcategory="sales")
## End(Not run)

[Package cdparcoord version 1.0.1 Index]