bootSVD {bootSVD} R Documentation

## Calculates bootstrap distribution of PCA (i.e. SVD) results

### Description

Applies fast bootstrap PCA, using the method from (Fisher et al., 2014). Dimension of the sample is denoted by p, and sample size is denoted by n, with p>n.

### Usage

```bootSVD(Y = NULL, K, V = NULL, d = NULL, U = NULL, B = 50,
output = "HD_moments", verbose = getOption("verbose"), bInds = NULL,
percentiles = c(0.025, 0.975), centerSamples = TRUE, pattern_V = "V_",
pattern_Vb = "Vb_")
```

### Arguments

 `Y` initial data sample, which can be either a matrix or a `ff` matrix. `Y` can be either tall (p by n) or wide (n by p). If `Y` is entered and `V`, `d` and `U` (see definitions below) are not entered, then `bootSVD` will also compute the SVD of `Y`. In this case where the SVD is computed, `bootSVD` will assume that the larger dimension of `Y` is p, and the smaller dimension of code Y is n (i.e. `bootSVD` assumes that (p>n). This assumption can be overriden by manually entering `V`, `U` and `d`. For cases where the entire data matrix can be easily stored in memory (e.g. p<50000), it is generally appropriate to enter `Y` as a standard matrix. When `Y` is large enough that matrix algebra on `Y` is too demanding for memory though, `Y` should be entered as a `ff` object, where the actual data is stored on disk. If `Y` has class `ff`, and `V`, `d` or `U` is not entered, then block matrix algebra will be used to calculate the PCs and bootstrap PCs. The results of these calculations will be returned as `ff` objects as well. `K` number of PCs to calculate the bootstrap distribution for. `V` (optional) the (p by n) full matrix of p-dimensional PCs for the sample data matrix. If `Y` is wide, these are the right singular vectors of `Y` (i.e. Y=UDV'). If `Y` is tall, these are the left singular vectors of `Y` (i.e. Y=VDU'). In general it is assumed that p>n, however, this can be overridden by setting `V` and `U` appropriately. Like `Y`, the argument `V` can be either a standard matrix or a `ff` matrix. If `V` is a `ff` object, the bootstrap PCs, if requested, will be returned as `ff` objects as well. `d` (optional) n-length vector of the singular values of `Y`. For example, if `Y` is tall, then we have Y=VDU' with `D=diag(d)`. `U` (optional) the (n by n) full set of n-dimensional singular vectors of `Y`. If `Y` is wide, these are the left singular vectors of `Y` (i.e. Y=UDV'). If `Y` is tall, these are the right singular vectors of `Y` (i.e. Y=VDU'). `B` number of bootstrap samples to compute. `output` a vector telling which descriptions of the bootstrap distribution should be calculated. Can include any of the following: 'initial_SVD', 'HD_moments', 'full_HD_PC_dist', and 'HD_percentiles'. See below for explanations of these outputs. For especially high dimensional cases, caution should be used if requesting 'full_HD_PC_dist' due to potential storage limitations. `verbose` if TRUE, the function will print progress during calculation procedure. `bInds` a (B by n) matrix of bootstrap indeces, where `B` is the number of bootstrap samples, and `n` is the sample size. The purpose of setting a specific bootstrap sampling index is to allow the results to be more precisely compared against standard bootstrap PCA calculations. If entered, the `bInds` argument will override the `B` argument. `percentiles` a vector containing percentiles to be used to calculate element-wise percentiles across the bootstrap distribution (both across the distribution of p-dimensional components and the distribution of n-dimensional components). For example, `percentiles=c(.025,.975)` will return the 2.5 and 97.5 percentiles, which can be used as 95 percent bootstrap percentile CIs. Alternatively, a longer vector of percentiles can be entered. `centerSamples` whether each bootstrap sample should be centered before calculating the SVD. `pattern_V` if `Y` is a class `ff` object, then the returned PCs of `Y` will also be a class `ff` object. `pattern_V` is passed to `ff` in creation of the `initial_SVD` output. Specifically, `pattern_V` is a filename prefix used for storing the high dimensional PCs of the original sample. `pattern_Vb` if `Y` or `V` is a class `ff` object, then the returned bootstrap PCs will also be class `ff` objects. `pattern_Vb` is passed to `ff` in creation of the `full_HD_PC_dist` output. Specifically, `pattern_Vb` is a filename prefix used for storing the high dimensional bootstrap PCs.

### Details

Users might also consider changing the global options `ffbatchbytes`, from the `ff` package, and `mc.cores`, from the `parallel` package. When `ff` objects are entered as arguments for `bootSVD`, the required matrix algebra is done using block matrix alebra. The `ffbatchbytes` option determines the size of the largest block matrix that will be held in memory at any one time. The `mc.cores` option (set to 1 by default) determines the level of parallelization to use when calculating the high dimensional distribution of the bootstrap PCs (see `mclapply`).

### Value

`bootSVD` returns a list that can include any of the following elements, depending on what is specified in the `output` argument:

initial_SVD

The singular value decomposition of the centered, original data matrix. `initial_SVD` is a list containing `V`, the matrix of p-dimensional principal components, `d`, the vector of singular values of `Y`, and `U`, the matrix of n-dimensional singular vectors of `Y`.

HD_moments

A list containing the bootstrap expected value (`EPCs`), element-wise bootstrap variance (`varPCs`), and element-wise bootstrap standard deviation (`sdPCs`) for each of the p-dimensional PCs. Each of these three elements of `HD_moments` is also a list, which contains K vectors, one for each PC. `HD_moments` also contains `momentCI`, a K-length list of (p by 2) matrices containing element-wise moment based confidence intervals for the PCs.

full_HD_PC_dist

A B-length list of matrices (or `ff` matrices), with the b^{th} list element equal to the (p by K) matrix of high dimensional PCs for the b^{th} bootstrap sample.
For especially high dimensional cases when the output is returned as `ff` matrices, caution should be used if requesting 'full_HD_PC_dist' due to potential storage limitations.
To reindex these PCs by k (the PC index) as opposed to b (the bootstrap index), see `reindexMatricesByK`. Again though, caution shoulded be used when reindexing PCs stored as `ff` objects, as this will double the number of files stored.

HD_percentiles

A list of K matrices, each of dimension (p by q), where q is the number of percentiles requested (i.e. q = `length(percentiles)`). The k^{th} matrix in `HD_percentiles` contains element-wise percentiles for the k^{th}, p-dimensional PC.

In addition, the following results are always included in the output, regardless of what is specified in the `output` argument:

 `full_LD_PC_dist` A B-length list of matrices, with the b^{th} list element equal to the (p by K) matrix of PCs of the scores in the b^{th} bootstrap sample. To reindex these vectors by k (the PC index), see `reindexMatricesByK`. `d_dist` A `B`-length list of vectors, with the b^{th} element of `d_dist` containing the n-length vector of singular values from the b^{th} bootstrap sample. To reindex these values by k (the PC index), see `reindexVectorsByK`. `U_dist` A `B`-length list of (n by K) matrices, with the columns of the b^{th} matrix containing the n-length singular vectors from the b^{th} bootstrap sample. To reindex these vectors by k (the PC index), see `reindexMatricesByK`. `LD_moments` A list that is comparable to `HD_moments`, but that instead describes the variability of the n-dimensional principal components of the resampled score matrices. `LD_moments` contains the bootstrap expected value (`EPCs`), element-wise bootstrap variances (`varPCs`), and element-wise bootstrap standard deviations (`sdPCs`) for each of the n-dimensional PCs. Each of these three elements of `LD_moments` is also a list, which contains K vectors, one for each PC. `LD_moments` also contains `momentCI`, a list of K (n by 2) matrices containing element-wise, moment-based confidence intervals for the PCs. `LD_percentiles` A list of K matrices, each of dimension (p by q), where q is the number of percentiles requested (i.e. q = `length(percentiles)`). The k^{th} matrix in `LD_percentiles` contains element-wise percentiles for the k^{th} n-dimensional PC.

### References

Aaron Fisher, Brian Caffo, and Vadim Zipunnikov. Fast, Exact Bootstrap Principal Component Analysis for p>1 million. 2014. http://arxiv.org/abs/1405.0922

### Examples

```#use small n, small B, for a quick illustration
set.seed(0)
Y<-simEEG(n=100, centered=TRUE, wide=TRUE)
b<-bootSVD(Y, B=50, K=2, output=
c('initial_SVD', 'HD_moments', 'full_HD_PC_dist',
'HD_percentiles'), verbose=interactive())
b

#explore results
matplot(b\$initial_SVD\$V[,1:4],type='l',main='Fitted PCs',lty=1)
legend('bottomright',paste0('PC',1:4),col=1:4,lty=1,lwd=2)

######################
# look specifically at 2nd PC
k<-2

######
#looking at HD variability

#plot several draws from bootstrap distribution
VsByK<-reindexMatricesByK(b\$full_HD_PC_dist)
matplot(t(VsByK[[k]][1:20,]),type='l',lty=1,
main=paste0('20 Draws from bootstrap\ndistribution of HD PC ',k))

#plot pointwise CIs
matplot(b\$HD_moments\$momentCI[[k]],type='l',col='blue',lty=1,
main=paste0('CIs For HD PC ',k))
matlines(b\$HD_percentiles[[k]],type='l',col='darkgreen',lty=1)
lines(b\$initial_SVD\$V[,k])
legend('topright',c('Fitted PC','Moment CIs','Percentile CIs'),
lty=1,col=c('black','blue','darkgreen'))
abline(h=0,lty=2,col='darkgrey')

######
# looking at LD variability

# plot several draws from bootstrap distribution
AsByK<-reindexMatricesByK(b\$full_LD_PC_dist)
matplot(t(AsByK[[k]][1:50,]),type='l',lty=1,
main=paste0('50 Draws from bootstrap\ndistribution of LD PC ',k),
xlim=c(1,10),xlab='PC index (truncated)')

# plot pointwise CIs
matplot(b\$LD_moments\$momentCI[[k]],type='o',col='blue',
lty=1,main=paste0('CIs For LD PC ',k),xlim=c(1,10),
xlab='PC index (truncated)',pch=1)
matlines(b\$LD_percentiles[[k]],type='o',pch=1,col='darkgreen',lty=1)
abline(h=0,lty=2,col='darkgrey')
legend('topright',c('Moment CIs','Percentile CIs'),lty=1,
pch=1,col=c('blue','darkgreen'))
#Note: variability is mostly due to rotations with the third and fourth PC.

# Bootstrap eigenvalue distribution
dsByK<-reindexVectorsByK(b\$d_dist)
boxplot(dsByK[[k]]^2,main=paste0('Covariance Matrix Eigenvalue ',k),
ylab='Bootstrap Distribution',
ylim=range(c(dsByK[[k]]^2,b\$initial_SVD\$d[k]^2)))
points(b\$initial_SVD\$d[k]^2,pch=18,col='red')
legend('bottomright','Sample Value',pch=18,col='red')

##################
#Example with ff input
library(ff)
Yff<-as.ff(Y, pattern='Y_')
# If desired, change options in 'ff' package to
# adjust the size of matrix blocks held in RAM.
# For example:
# options('ffbatchbytes'=100000)
ff_dir<-tempdir()
pattern_V <- paste0(ff_dir,'/V_')
pattern_Vb <- paste0(ff_dir,'/Vb_')
bff <- bootSVD(Yff, B=50, K=2, output=c('initial_SVD', 'HD_moments',
'full_HD_PC_dist', 'HD_percentiles'), pattern_V= pattern_V,
pattern_Vb=pattern_Vb, verbose=interactive())

# Note that elements of full_HD_PC_dist and initial_SVD
# have class 'ff'
str(lapply(bff,function(x) class(x[[1]])))
#Show some results of bootstrap draws
plot(bff\$full_HD_PC_dist[[1]][,k],type='l')
#Reindexing by K will create a new set of ff files.
VsByKff<-reindexMatricesByK(bff\$full_HD_PC_dist,
pattern=paste0(ff_dir,'/Vk_'))
physical(bff\$full_HD_PC_dist[[1]])\$filename
physical(VsByKff[[1]])\$filename
matplot(t(VsByKff[[k]][1:10,]),type='l',lty=1,
main=paste0('Bootstrap Distribution of PC',k))

# Saving and moving results:
saveRDS(bff,file=paste0(ff_dir,'/bff.rds'))
close(bff\$initial_SVD\$V)
physical(bff\$initial_SVD\$V)\$filename
# If the 'ff' files on disk are moved or renamed,
# this filename attribute can be changed:
old_ff_path <- physical(bff\$initial_SVD\$V)\$filename
new_ff_path <- paste0(tempdir(),'/new_V_file.ff')
file.rename(from= old_ff_path, to= new_ff_path)
physical(bff\$initial_SVD\$V)\$filename <- new_ff_path
matplot(bff\$initial_SVD\$V[,1:4],type='l',lty=1)

```

[Package bootSVD version 1.1 Index]