czek_matrix {RMaCzek}R Documentation

Preprocess data to produce Czekanowski's Diagram.

Description

Preprocess the data to generate a matrix of category czek_matrix for generating Czekanowski's Diagram. This method also offers exact and fuzzy clustering algorithms for Czekanowski's Diagram.

Usage

czek_matrix(
  x,
  order = "OLO",
  n_classes = 5,
  interval_breaks = NULL,
  monitor = FALSE,
  distfun = dist,
  scale_data = TRUE,
  focal_obj = NULL,
  as_dist = FALSE,
  original_diagram = FALSE,
  column_order_stat_grouping = NULL,
  dist_args = list(),
  cluster = FALSE,
  cluster_type = "exact",
  num_cluster = 3,
  sig.lvl = 0.05,
  scale_bandwidth = 0.05,
  min.size = 30,
  eps = 0.01,
  pts = c(1, 5),
  alpha = 0.2,
  theta = 0.9,
  ...
)

Arguments

x

A numeric matrix, data frame or a 'dist' object.

order

Specifies which seriation method should be applied. The standard setting is the seriation method OLO. If NA or NULL, then no seriation is done and the original ordering is saved. The user may provide their own ordering, through a number vector of indices. Also in this case no rearrangement will be done.

n_classes

Specifies how many classes the distances should be divided into. The standard setting is 5 classes.

interval_breaks

Specifies the partition boundaries for the distances. As a standard setting, each class represents an equal amount of distances. If the interval breaks are positive and sum up to 1, then it is assumed that they specify percentages of the distances in each interval. Otherwise, if provided as a numeric vector not summing up to 1, they specify the exact boundaries for the symbols representing distance groups.

monitor

Specifies if the distribution of the distances should be visualized. The standard setting is that the distribution will not be visualized. TRUE and "cumulativ_plot" is available.

distfun

Specifies which distance function should be used. Standard setting is the dist function which uses the Euclidean distance. The first argument of the function has to be the matrix or data frame containing the data.

scale_data

Specifies if the data set should be scaled. The standard setting is that the data will be scaled.

focal_obj

Numbers or names of objects (rows if x is a dataset and not 'dist' object) that are not to take part in the reordering procedure. These observations will be placed as last rows and columns of the output matrix. See Details.

as_dist

If TRUE, then the distance matrix of x is returned, with object ordering, instead of the matrix with the levels assigned in place of the original distances.

original_diagram

If TRUE, then the returned matrix corresponds as close as possible to the original method proposed by Czekanowski (1909). The levels are column specific and not matrix specific. See Details

column_order_stat_grouping

If original_diagram is TRUE, then here one can pass the partition boundaries for the ranking in each column.

dist_args

Specifies further parameters that can be passed on to the distance function.

cluster

If TRUE, Czekanowski's clustering is performed.

cluster_type

Specifies the cluster type and it can be ’exact’ or ’fuzzy’.

num_cluster

Specifies the number of clusters.

sig.lvl

The threshold for testing a change point is statistically significant. This value is passed to ecp::e.divisive().

scale_bandwidth

A ratio to control the width of the reaching range.

min.size

Minimum number of observations between change points.

eps

A vector of epsilon values for FDBScan.

pts

A vector of minimum points for FDBScan.

alpha

The weighting factor for density score adjustments.

theta

The weighting factor for density score adjustments.

...

Further parameters that can be passed on to the seriate function in the seriation package.

Value

The function returns a matrix with class czek_matrix. The returned object is expected to be passed to the plot function if as_dist is FALSE. If as_dist is passed as TRUE, then a czek_matrix object is returned that is not suitable for plotting. As an attribute of the output the optimized criterion value is returned. However, this is a guess based on seriation::seriate()'s and seriation::criterion()'s manuals. If something else was optimized, e.g. due to user's parameters, then this will be wrong. If unable to guess, then NA saved in the attribute.

Examples

# Set data ####
x<-mtcars

# Different type of input that give same result ############
czek_matrix(x)
czek_matrix(stats::dist(scale(x)))
## Not run: 
## below a number of other options are shown
## but they take too long to run

# Change seriation method ############
#seriation::show_seriation_methods("dist")
czek_matrix(x,order = "GW")
czek_matrix(x,order = "ga")
czek_matrix(x,order = sample(1:nrow(x)))

# Change number of classes ############
czek_matrix(x,n_classes = 3)

# Change the partition boundaries ############

#10%, 40% and 50%
czek_matrix(x,interval_breaks = c(0.1,0.4,0.5))

#[0,1] (1,4] (4,6] (6,8.48]
czek_matrix(x,interval_breaks = c(0,1,4,6,8.48))

#[0,1.7] (1.7,3.39]  (3.39,5.09] (5.09,6.78] (6.78,8.48]
czek_matrix(x,interval_breaks = "equal_width_between_classes")

# Change number of classes ############
czek_matrix(x,monitor = TRUE)
czek_matrix(x,monitor = "cumulativ_plot")

# Change distance function ############
czek_matrix(x,distfun = function(x) stats::dist(x,method = "manhattan"))

# Change dont scale the data ############
czek_matrix(x,scale_data = FALSE)
czek_matrix(stats::dist(x))

# Change additional settings to the seriation method ############
czek_matrix(x,order="ga",control=list(popSize=200, suggestions=c("SPIN_STS","QAP_2SUM")))

# Create matrix as originally described by Czekanowski (1909), with each column
# assigned levels according to how the order statistics of the  distances in it
# are grouped. The grouping below is the one used by Czekanowski (1909).
czek_matrix(x,original_diagram=TRUE,column_order_stat_grouping=c(3,4,5,6))

# Create matrix with two focal object that will not influence seriation
czek_matrix(x,focal_obj=c("Merc 280","Merc 450SL"))
# Same results but with object indices
czek_res<-czek_matrix(x,focal_obj=c(10,13))

# we now place the two objects in a new place
czek_res_neworder<-manual_reorder(czek_res,c(1:10,31,11:20,32,21:30), orig_data=x)

# the same can be alternatively done by hand
attr(czek_res,"order")<-attr(czek_res,"order")[c(1:10,31,11:20,32,21:30)]
# and then correct the values of the different criteria so that they
# are consistent with the new ordering
attr(czek_res,"Path_length")<-seriation::criterion(stats::dist(scale(x)),
order=seriation::ser_permutation(attr(czek_res, "order")),
method="Path_length")

# Here we need to know what criterion was used for the seriation procedure
# If the seriation package was used, then see the manual for seriation::seriate()
# seriation::criterion().
# If the genetic algorithm shipped with RMaCzek was used, then it was the Um factor.
attr(czek_res,"criterion_value")<-seriation::criterion(stats::dist(scale(x)),
order=seriation::ser_permutation(attr(czek_res, "order")),method="Path_length")
attr(czek_res,"Um")<-RMaCzek::Um_factor(stats::dist(scale(x)),
order= attr(czek_res, "order"), inverse_um=FALSE)
# Czekanowski's Clusterings ############
# Exact Clustering
czek_exact = czek_matrix(x, order = "GW", cluster = TRUE, num_cluster = 2, min.size = 2)
plot(czek_exact)
attr(czek_exact, "cluster_type") # To get the clustering type.
attr(czek_exact, "cluster_res") # To get the clustering suggestion.
attr(czek_exact, "membership") # To get the membership matrix

# Fuzzy Clustering
czek_fuzzy = czek_matrix(x, order = "OLO", cluster = TRUE, num_cluster = 2,
cluster_type = "fuzzy", min.size = 2, scale_bandwidth = 0.2)
plot(czek_fuzzy)
attr(czek_fuzzy, "cluster_type") # To get the clustering type.
attr(czek_fuzzy, "cluster_res") # To get the clustering suggestion.
attr(czek_fuzzy, "membership") # To get the membership matrix

## End(Not run)


[Package RMaCzek version 1.6.0 Index]