CalcAndWriteDissimilarityMatrix {parallelpam}R Documentation

CalcAndWriteDissimilarityMatrix

Description

Writes a binary symmetric matrix with the dissimilarities between ROWS of the data stored in a binary matrix in the jmatrix/parallelpam package format.
The input matrix of vectors can be a full or a sparse matrix and the algorithm has been modified to calculate faster for sparse matrices.
Output matrix type can be float or double type (but look at the comments in 'Details').

Usage

CalcAndWriteDissimilarityMatrix(
  ifname,
  ofname,
  distype = "L2",
  restype = "float",
  comment = "",
  nthreads = 0L
)

Arguments

ifname

A string with the name of the file containing the counts as a binary matrix.

ofname

A string with the name of the binary output file to contain the symmetric dissimilarity matrix.

distype

The dissimilarity to be calculated. It must be one of these strings: 'L1', 'L2', 'Pearson', 'Cos' or 'WEuc'.
Respectively: L1 (Manhattan), L2 (Euclidean), Pearson (Pearson dissimilarity), Cos (cosine distance), WEuc (weigthed Euclidean, with inverse-stdevs as weights).
Default: 'L2'.

restype

The data type of the result. It can be one of the strings 'float' or 'double'. Default: float (and don't change it unless you REALLY need to...).

comment

Comment to be added to the dissimilary matrix. Default: "" (no comment)

nthreads

Number of threads to be used for the parallel calculations with this meaning:
-1: don't use threads.
0: let the function choose according to the number of rows and to the number of available cores.
Any possitive number > 1: use that number of threads. You can use even more than cores, but this is discouraged and raises a warning.
Default: 0.

Details

The parameter restype forces the output to be a matrix of either floats or doubles. Precision of float is normally good enough; but if you need double precision (may be because you expect your results to be in a large range, two to three orders of magnitude), change it.
Nevertheless, notice that this at the expense of double memory usage, which is QUADRATIC with the number of individuals (rows) in your input matrix.

Value

No return value, called for side effects (creates a file)

Examples

Rf <- matrix(runif(50000),nrow=100)
tmpfile1=paste0(tempdir(),"/Rfullfloat.bin")
JWriteBin(Rf,tmpfile1,dtype="float",dmtype="full",
          comment="Full matrix of floats, 100 rows, 500 columns")
JMatInfo(tmpfile1)
tmpdisfile1=paste0(tempdir(),"/RfullfloatDis.bin")
# Distance file calculated from the matrix stored as full
CalcAndWriteDissimilarityMatrix(tmpfile1,tmpdisfile1,distype="L2",
                         restype="float",comment="L2 distance matrix from full",nthreads=0)
JMatInfo(tmpdisfile1)
tmpfile2=paste0(tempdir(),"/Rsparsefloat.bin")
JWriteBin(Rf,tmpfile2,dtype="float",dmtype="sparse",
                         comment="Sparse matrix of floats, 100 rows, 500 columns")
JMatInfo(tmpfile2)
# Distance file calculated from the matrix stored as sparse
tmpdisfile2=paste0(tempdir(),"/RsparsefloatDis.bin")
CalcAndWriteDissimilarityMatrix(tmpfile2,tmpdisfile2,distype="L2",
                         restype="float",comment="L2 distance matrix from sparse",nthreads=0)
JMatInfo(tmpdisfile2)
# Read both versions
Dfu<-GetJManyRows(tmpdisfile1,c(1:nrow(Rf)))
Dsp<-GetJManyRows(tmpdisfile2,c(1:nrow(Rf)))
# and compare them
max(Dfu-Dsp)

[Package parallelpam version 1.4.3 Index]