CHICKN_W1 {chickn} | R Documentation |
An implementation of the complete pipeline of the CHICKN algorithm.
CHICKN_W1( Data, K = 2, k_total, K_W1 = NULL, kernel_type = "Gaussian", distance_type = "W1", Freq = NULL, ncores = 2, max_neighbors = 32, nblocks = 64, N0 = 10000, max_Nsize = 32, DoPreimage = FALSE, DIR_output = tempfile(), DIR_tmp = tempfile(), BIG = FALSE, verbose = FALSE, ... )
Data |
A Filebacked Big Matrix n x N. |
K |
Number of cluster at each call of clustering method. Default is 2. |
k_total |
An upper bound of the total number of clusters. |
K_W1 |
A Filebacked Big Matrix. Nystrom kernel matrix s \times N,
where N is the number of signals in the training collection and s is the Nystrom sample size.
By default is NULL and it is generated using |
kernel_type |
Kernel function type c('Gaussian', 'Laplacian'). |
distance_type |
Distance function type. The available types are Wasserstein-1 ('W1') and Euclidean ('Euclide'). The default value is 'W1'. |
Freq |
A frequency matrix m x n with frequency vectors in rows.
If NULL, the frequency vectors are generated by |
ncores |
Number of cores. Default is 2. |
max_neighbors |
Number of neighbors used to estimate the kernel parameter |
nblocks |
Number of blocks, on which the regression is performed. Default is 32. |
N0 |
Number of data vectors used for the variance estimation in |
max_Nsize |
Number of neighbors used to compute consensus chromatograms. |
DoPreimage |
logical that controls whether to compute the consensus chromatograms. Default is TRUE. |
DIR_output |
A directory to save the results. |
DIR_tmp |
A directory for temporal files. |
BIG |
logical parameter that controls whether the resulting consensus chromatograms are stored as a Filebacked Big Matrix ('Centroid_preimage.bk'). Default is FALSE. |
verbose |
logical that indicates whether dysplay the processing steps. |
... |
Additional arguments passed on to |
CHICKN_W1
compresses the data by computing a Nystrom kernel approximation and
applying the sketching operator from (Keriven et al. 2018).
See Nystrom_kernel
and Sketch
functions.
Then clusters are recovered by operating on the compressed data version.
It can use the kernel function based on the
Wasserstein-1 or the Euclidean distances. It generates in DIR_output
directory the following files:
'Cluster_assign_out.bk' is a Filebacked Big Matrix N x maxLevel
+1, which stores the cluster assignment at each hierarchical level.
'Centroids_out.bk' is a Filebacked Big Matrix with the resulting cluster centroids in columns.
A list with the following attributes:
gamma
is the estimated kernel parameter.
CompressedData
is the Nystrom kernel matrix.
sigma
is the estimated variance.
Frequency
is the frequency matrix m x n.
Clusters
is the cluster assignment.
Permiakova O, Guibert R, Kraut A, Fortin T, Hesse AM, Burger T (2020) "CHICKN: Extraction of peptide chromatographic elution profiles from large scale mass spectrometry data by means of Wasserstein compressive hierarchical cluster analysis." BMC Bioinformatics (under revision).
Nystrom_kernel
, GenerateFrequencies
,
hcc_parallel
, Preimage
, bigstatsr
data("UPS2") N = ncol(UPS2) n= nrow(UPS2) X_FBM = bigstatsr::FBM(init = UPS2, ncol=N, nrow = n)$save() output <- CHICKN_W1(Data = X_FBM, K = 2, k_total =8, max_neighbors = 10, ncores = 2, N0 = N, DoPreimage = FALSE)