feemsplithalf {albatross} | R Documentation |
Split-half analysis of PARAFAC models
Description
This function validates PARAFAC with different numbers of components by means of splitting the data cube in halves, fitting PARAFAC to them and comparing the results (DeSarbo 1984).
Usage
feemsplithalf(
cube, nfac, splits, random, groups, fixed, ..., progress = TRUE
)
## S3 method for class 'feemsplithalf'
plot(
x, kind = c('tcc', 'factors', 'aggtcc', 'bandfactors'), ...
)
## S3 method for class 'feemsplithalf'
print(x, ...)
## S3 method for class 'feemsplithalf'
coef(
object, kind = c('tcc', 'factors', 'aggtcc', 'bandfactors'), ...
)
Arguments
cube |
A |
nfac |
An integer vector of numbers of factors to check. |
splits |
A scalar or a two-element vector consisting of whole numbers. The first element is the number of parts to split the data cube into, which must be even. After splitting, the parts are recombined into non-intersecting halves (Murphy, Stedmon, Graeber, and Bro 2013), which are subjected to PARAFAC decomposition and compared against each other. The second element, if specified, limits the total number of comparisons between the pairs, since the number of potential ways to recombine the parts of the data cube into halves grows very quickly. The number of PARAFAC models fitted is
Mutually incompatible with the parameters |
random |
Number of times to shuffle the dataset, split into non-intersecting halves, fit a PARAFAC model to each of the halves and compare halves against each other (Krylov, Drozdova, and Labutin 2020). The number of PARAFAC models fitted is
Mutually incompatible with the parameters |
groups |
Use this argument to preserve the ratios between the groups present
in the original dataset when constructing the halves. If specified,
must be a factor or an integer vector of length For the split-combine method ( Mutually incompatible with the |
fixed |
Use this argument to manually specify the contents of the halves to test. The argument must be a list containing two-element lists specifying the halves to compare. Each half must be a vector consisting of whole numbers specifying sample indices in the cube (see the example). It is considered an error to specify a sample in both halves. Mutually incompatible with the parameters |
progress |
Set to FALSE to disable the progress bar. |
x , object |
An object returned by |
kind |
Chooses what type of data to return or plot:
|
... |
|
Details
As the models (loadings \mathbf A
,
\mathbf B
and scores \mathbf C
)
are fitted, they are compared to the first model of the same number
of factors (Tucker's congruence coefficient is calculated using
congru
for emission and excitation mode
factors, then the smallest value of the two is chosen for the purposes
of matching). The models are first reordered according to the best
match by TCC value, then rescaled (Riu and Bro 2003) by minimising
|| \mathbf A \, \mathrm{diag}(\mathbf s_\mathrm A) -
\mathbf A^\mathrm{orig} ||^2
and
|| \mathbf{B} \, \mathrm{diag}(\mathbf s_\mathrm B) -
\mathbf B^\mathrm{orig} ||^2
over \mathbf s_\mathrm A
and
\mathbf s_\mathrm B
, subject to
\mathrm{diag}(\mathbf s_\mathrm A) \times
\mathrm{diag}(\mathbf s_\mathrm B) \times
\mathrm{diag}(\mathbf s_\mathrm C) = \mathbf I
, to make them comparable.
To perform stratified sampling on a real-valued variable (e.g. salinity,
depth), consider binning samples into groups using
cut
, perhaps after histogram flattening using
ecdf(x)(x)
. To determine the number of breaks, consider
nclass.Sturges
.
To conserve memory, feemsplithalf
puts the user-provided
cube
in an environment and passes it via envir
and
subset
options of feemparafac
. This means that,
unlike in feemparafac
, the cube
argument has
to be a feemcube
object and passing envir
and
subset
options to feemsplithalf
is not supported.
Instead of forwarding the arguments parallel
, cl
to
multiway::parafac
, feemsplithalf
schedules the calls to feemparafac
on the cluster by
itself. This makes it possible to fit more than nstart
models
at the same time if enough nodes are present in the parallel
cluster cl
.
plot.feemsplithalf
plots results of the split-half procedure
(TCC or loading values depending on the kind
argument)
using lattice graphics. Sane defaults are provided for
xyplot
parameters xlab
, ylab
,
as.table
, but they can be overridden.
print.feemsplithalf
displays a very short summary of the
analysis, currently the minimum TCC value for each number of components.
coef.feemsplithalf
returns the Tucker's congruence
coefficients resulting from the split-half analysis.
Value
- feemsplithalf, print.feemsplithalf
-
An object of class
feemsplithalf
, containing named fields:- factors
-
A
list
offeemparafac
objects containing the factors of the halves. The list has dimensions, the first one corresponding to the halves (always 2), the second to different numbers of factors (as many as innfac
) and the third to different groupings of the samples (depends onsplits
orrandom
). - tcc
-
A named list containing arrays of Tucker's congruence coefficients between the halves. Each entry in the list corresponds to an element in the
nfac
argument. The dimensions of each array in the list correspond to, in order: the factors (1 tonfac[i]
), the modes (emission or excitation) and the groupings of the samples (depending onsplits
orrandom
). - nfac
-
A copy of
nfac
argument.
- plot.feemsplithalf
-
A lattice plot object. Its
print
orplot
method will draw the plot on an appropriate plotting device. - coef.feemsplithalf
-
A
data.frame
containing various columns, depending on the value of thekind
argument:- tcc
-
- factor
-
The factor (out of
nfac
) under consideration. - tcc
-
Tucker's congruence coefficient between a pair of matching components. Out of two possible values (TCC between excitation loadings or emission loadings), the minimal one is chosen, because the same rule is used to find which components match when reordering them in a pair of models.
- test
-
The sequence number for each pair of models in the split-half test, related to the third dimension of
object$factors
orobject$tcc
. May be used to group values for plotting or aggregation. - subset
-
Consists of two-element lists containing indices of the samples in each half of the original cube.
- nfac
-
The number of factors in the pair of models under consideration.
- factors
-
- wavelength
-
Emission and excitation wavelengths.
- value
-
The values of the loadings.
- factor
-
Number of the factor,
1
tonfac
. - mode
-
The mode the loading value belongs to, “Emission” or “Excitation”.
- nfac
-
Total number of factors.
- test
-
Sequence number of a split-half test, indicating a given way to split the dataset in a group of splits with the same numbers of factors.
- half
-
Number of the half,
1
or2
. - subset
-
For every row, this is an integer vector indicating the subset of the original data cube that the loadings have been obtained from.
- aggtcc
-
The columns
tcc
,nfac
,test
after aggregation ofcoef(kind = 'tcc')
. - bandfactors
-
Columns
wavelength
,factor
,mode
,nfac
fromcoef(kind = 'factors')
, plus columnslower
,estimate
,upper
signifying the outputs from the aggregation function.
References
DeSarbo WS (1984). “An Application of PARAFAC to a Small Sample Problem, Demonstrating Preprocessing, Orthogonality Constraints, and Split-Half Diagnostic Techniques (Appendix).” Research Methods for Multimode Data Analysis, 602-642. https://papers.ssrn.com/abstract=2783446.
Krylov I, Drozdova A, Labutin T (2020). “Albatross R package to study PARAFAC components of DOM fluorescence from mixing zones of arctic shelf seas.” Chemometrics and Intelligent Laboratory Systems, 207(104176). doi:10.1016/j.chemolab.2020.104176.
Murphy KR, Stedmon CA, Graeber D, Bro R (2013). “Fluorescence spectroscopy and multi-way techniques. PARAFAC.” Analytical Methods, 5, 6557-6566. doi:10.1039/c3ay41160e.
Riu J, Bro R (2003). “Jack-knife technique for outlier detection and estimation of standard errors in PARAFAC models.” Chemometrics and Intelligent Laboratory Systems, 65(1), 35-49. doi:10.1016/S0169-7439(02)00090-4.
See Also
feemparafac
, parafac
,
congru
, feemcube
.
Examples
data(feems)
cube <- feemscale(feemscatter(cube, rep(14, 4)), na.rm = TRUE)
(sh <- feemsplithalf(
cube, 1:4, splits = 4, # => S4C6T3
# splits = c(4, 2) would be S4C4T2, and so on
# the rest is passed to multiway::parafac;
ctol = 1e-4
# here we set a mild stopping criterion for speed;
# be sure to use a stricter one for real tasks
))
# specifying fixed halves to compare as list of 2-element lists
fixed <- list(
list(1:6, 7:12),
list(seq(1, 11, 2), seq(2, 12, 2))
)
sh.f <- feemsplithalf(cube, 2:3, fixed = fixed, ctol = 1e-4)
plot(sh, 'aggtcc')
head(coef(sh, 'factors'))