CHICKN_W1 {chickn} R Documentation

## Chromatogram Hierarchical Compressive K-means with Nystrom approximation

### Description

An implementation of the complete pipeline of the CHICKN algorithm.

### Usage

CHICKN_W1(
Data,
K = 2,
k_total,
K_W1 = NULL,
kernel_type = "Gaussian",
distance_type = "W1",
Freq = NULL,
ncores = 2,
max_neighbors = 32,
nblocks = 64,
N0 = 10000,
max_Nsize = 32,
DoPreimage = FALSE,
DIR_output = tempfile(),
DIR_tmp = tempfile(),
BIG = FALSE,
verbose = FALSE,
...
)


### Arguments

 Data A Filebacked Big Matrix n x N. K Number of cluster at each call of clustering method. Default is 2. k_total An upper bound of the total number of clusters. K_W1 A Filebacked Big Matrix. Nystrom kernel matrix s \times N, where N is the number of signals in the training collection and s is the Nystrom sample size. By default is NULL and it is generated using Nystrom_kernel function. kernel_type Kernel function type c('Gaussian', 'Laplacian'). distance_type Distance function type. The available types are Wasserstein-1 ('W1') and Euclidean ('Euclide'). The default value is 'W1'. Freq A frequency matrix m x n with frequency vectors in rows. If NULL, the frequency vectors are generated by GenerateFrequencies function. ncores Number of cores. Default is 2. max_neighbors Number of neighbors used to estimate the kernel parameter gamma. Default is 32. nblocks Number of blocks, on which the regression is performed. Default is 32. N0 Number of data vectors used for the variance estimation in EstimSigma. max_Nsize Number of neighbors used to compute consensus chromatograms. DoPreimage logical that controls whether to compute the consensus chromatograms. Default is TRUE. DIR_output A directory to save the results. DIR_tmp A directory for temporal files. BIG logical parameter that controls whether the resulting consensus chromatograms are stored as a Filebacked Big Matrix ('Centroid_preimage.bk'). Default is FALSE. verbose logical that indicates whether dysplay the processing steps. ... Additional arguments passed on to COMPR.

### Details

CHICKN_W1 compresses the data by computing a Nystrom kernel approximation and applying the sketching operator from (Keriven et al. 2018). See Nystrom_kernel and Sketch functions. Then clusters are recovered by operating on the compressed data version. It can use the kernel function based on the Wasserstein-1 or the Euclidean distances. It generates in DIR_output directory the following files:

• 'Cluster_assign_out.bk' is a Filebacked Big Matrix N x maxLevel+1, which stores the cluster assignment at each hierarchical level.

• 'Centroids_out.bk' is a Filebacked Big Matrix with the resulting cluster centroids in columns.

### Value

A list with the following attributes:

• gamma is the estimated kernel parameter.

• CompressedData is the Nystrom kernel matrix.

• sigma is the estimated variance.

• Frequency is the frequency matrix m x n.

• Clusters is the cluster assignment.

### References

• Permiakova O, Guibert R, Kraut A, Fortin T, Hesse AM, Burger T (2020) "CHICKN: Extraction of peptide chromatographic elution profiles from large scale mass spectrometry data by means of Wasserstein compressive hierarchical cluster analysis." BMC Bioinformatics (under revision).

Nystrom_kernel, GenerateFrequencies, hcc_parallel, Preimage, bigstatsr

### Examples


data("UPS2")
N = ncol(UPS2)
n= nrow(UPS2)
X_FBM = bigstatsr::FBM(init = UPS2, ncol=N, nrow = n)\$save()
output  <- CHICKN_W1(Data = X_FBM, K = 2, k_total =8, max_neighbors = 10, ncores = 2,
N0 = N, DoPreimage = FALSE)



[Package chickn version 1.2.3 Index]