bootSVD {bootSVD} | R Documentation |
Calculates bootstrap distribution of PCA (i.e. SVD) results
Description
Applies fast bootstrap PCA, using the method from (Fisher et al., 2014). Dimension of the sample is denoted by p
, and sample size is denoted by n
, with p>n
.
Usage
bootSVD(Y = NULL, K, V = NULL, d = NULL, U = NULL, B = 50,
output = "HD_moments", verbose = getOption("verbose"), bInds = NULL,
percentiles = c(0.025, 0.975), centerSamples = TRUE, pattern_V = "V_",
pattern_Vb = "Vb_")
Arguments
Y |
initial data sample, which can be either a matrix or a |
K |
number of PCs to calculate the bootstrap distribution for. |
V |
(optional) the ( |
d |
(optional) |
U |
(optional) the ( |
B |
number of bootstrap samples to compute. |
output |
a vector telling which descriptions of the bootstrap distribution should be calculated. Can include any of the following: 'initial_SVD', 'HD_moments', 'full_HD_PC_dist', and 'HD_percentiles'. See below for explanations of these outputs. |
verbose |
if TRUE, the function will print progress during calculation procedure. |
bInds |
a ( |
percentiles |
a vector containing percentiles to be used to calculate element-wise percentiles across the bootstrap distribution (both across the distribution of |
centerSamples |
whether each bootstrap sample should be centered before calculating the SVD. |
pattern_V |
if |
pattern_Vb |
if |
Details
Users might also consider changing the global options ffbatchbytes
, from the ff
package, and mc.cores
, from the parallel
package. When ff
objects are entered as arguments for bootSVD
, the required matrix algebra is done using block matrix alebra. The ffbatchbytes
option determines the size of the largest block matrix that will be held in memory at any one time. The mc.cores
option (set to 1 by default) determines the level of parallelization to use when calculating the high dimensional distribution of the bootstrap PCs (see mclapply
).
Value
bootSVD
returns a list that can include any of the following elements, depending on what is specified in the output
argument:
- initial_SVD
The singular value decomposition of the centered, original data matrix.
initial_SVD
is a list containingV
, the matrix ofp
-dimensional principal components,d
, the vector of singular values ofY
, andU
, the matrix ofn
-dimensional singular vectors ofY
.- HD_moments
A list containing the bootstrap expected value (
EPCs
), element-wise bootstrap variance (varPCs
), and element-wise bootstrap standard deviation (sdPCs
) for each of thep
-dimensional PCs. Each of these three elements ofHD_moments
is also a list, which containsK
vectors, one for each PC.HD_moments
also containsmomentCI
, aK
-length list of (p
by 2) matrices containing element-wise moment based confidence intervals for the PCs.- full_HD_PC_dist
A
B
-length list of matrices (orff
matrices), with theb^{th}
list element equal to the (p
byK
) matrix of high dimensional PCs for theb^{th}
bootstrap sample.
For especially high dimensional cases when the output is returned asff
matrices, caution should be used if requesting 'full_HD_PC_dist' due to potential storage limitations.
To reindex these PCs byk
(the PC index) as opposed tob
(the bootstrap index), seereindexMatricesByK
. Again though, caution shoulded be used when reindexing PCs stored asff
objects, as this will double the number of files stored.- HD_percentiles
A list of
K
matrices, each of dimension (p
byq
), whereq
is the number of percentiles requested (i.e.q
=length(percentiles)
). Thek^{th}
matrix inHD_percentiles
contains element-wise percentiles for thek^{th}
,p
-dimensional PC.
In addition, the following results are always included in the output, regardless of what is specified in the output
argument:
full_LD_PC_dist |
A |
d_dist |
A |
U_dist |
A |
LD_moments |
A list that is comparable to |
LD_percentiles |
A list of |
References
Aaron Fisher, Brian Caffo, and Vadim Zipunnikov. Fast, Exact Bootstrap Principal Component Analysis for p>1 million. 2014. http://arxiv.org/abs/1405.0922
Examples
#use small n, small B, for a quick illustration
set.seed(0)
Y<-simEEG(n=100, centered=TRUE, wide=TRUE)
b<-bootSVD(Y, B=50, K=2, output=
c('initial_SVD', 'HD_moments', 'full_HD_PC_dist',
'HD_percentiles'), verbose=interactive())
b
#explore results
matplot(b$initial_SVD$V[,1:4],type='l',main='Fitted PCs',lty=1)
legend('bottomright',paste0('PC',1:4),col=1:4,lty=1,lwd=2)
######################
# look specifically at 2nd PC
k<-2
######
#looking at HD variability
#plot several draws from bootstrap distribution
VsByK<-reindexMatricesByK(b$full_HD_PC_dist)
matplot(t(VsByK[[k]][1:20,]),type='l',lty=1,
main=paste0('20 Draws from bootstrap\ndistribution of HD PC ',k))
#plot pointwise CIs
matplot(b$HD_moments$momentCI[[k]],type='l',col='blue',lty=1,
main=paste0('CIs For HD PC ',k))
matlines(b$HD_percentiles[[k]],type='l',col='darkgreen',lty=1)
lines(b$initial_SVD$V[,k])
legend('topright',c('Fitted PC','Moment CIs','Percentile CIs'),
lty=1,col=c('black','blue','darkgreen'))
abline(h=0,lty=2,col='darkgrey')
######
# looking at LD variability
# plot several draws from bootstrap distribution
AsByK<-reindexMatricesByK(b$full_LD_PC_dist)
matplot(t(AsByK[[k]][1:50,]),type='l',lty=1,
main=paste0('50 Draws from bootstrap\ndistribution of LD PC ',k),
xlim=c(1,10),xlab='PC index (truncated)')
# plot pointwise CIs
matplot(b$LD_moments$momentCI[[k]],type='o',col='blue',
lty=1,main=paste0('CIs For LD PC ',k),xlim=c(1,10),
xlab='PC index (truncated)',pch=1)
matlines(b$LD_percentiles[[k]],type='o',pch=1,col='darkgreen',lty=1)
abline(h=0,lty=2,col='darkgrey')
legend('topright',c('Moment CIs','Percentile CIs'),lty=1,
pch=1,col=c('blue','darkgreen'))
#Note: variability is mostly due to rotations with the third and fourth PC.
# Bootstrap eigenvalue distribution
dsByK<-reindexVectorsByK(b$d_dist)
boxplot(dsByK[[k]]^2,main=paste0('Covariance Matrix Eigenvalue ',k),
ylab='Bootstrap Distribution',
ylim=range(c(dsByK[[k]]^2,b$initial_SVD$d[k]^2)))
points(b$initial_SVD$d[k]^2,pch=18,col='red')
legend('bottomright','Sample Value',pch=18,col='red')
##################
#Example with ff input
library(ff)
Yff<-as.ff(Y, pattern='Y_')
# If desired, change options in 'ff' package to
# adjust the size of matrix blocks held in RAM.
# For example:
# options('ffbatchbytes'=100000)
ff_dir<-tempdir()
pattern_V <- paste0(ff_dir,'/V_')
pattern_Vb <- paste0(ff_dir,'/Vb_')
bff <- bootSVD(Yff, B=50, K=2, output=c('initial_SVD', 'HD_moments',
'full_HD_PC_dist', 'HD_percentiles'), pattern_V= pattern_V,
pattern_Vb=pattern_Vb, verbose=interactive())
# Note that elements of full_HD_PC_dist and initial_SVD
# have class 'ff'
str(lapply(bff,function(x) class(x[[1]])))
#Show some results of bootstrap draws
plot(bff$full_HD_PC_dist[[1]][,k],type='l')
#Reindexing by K will create a new set of ff files.
VsByKff<-reindexMatricesByK(bff$full_HD_PC_dist,
pattern=paste0(ff_dir,'/Vk_'))
physical(bff$full_HD_PC_dist[[1]])$filename
physical(VsByKff[[1]])$filename
matplot(t(VsByKff[[k]][1:10,]),type='l',lty=1,
main=paste0('Bootstrap Distribution of PC',k))
# Saving and moving results:
saveRDS(bff,file=paste0(ff_dir,'/bff.rds'))
close(bff$initial_SVD$V)
physical(bff$initial_SVD$V)$filename
# If the 'ff' files on disk are moved or renamed,
# this filename attribute can be changed:
old_ff_path <- physical(bff$initial_SVD$V)$filename
new_ff_path <- paste0(tempdir(),'/new_V_file.ff')
file.rename(from= old_ff_path, to= new_ff_path)
physical(bff$initial_SVD$V)$filename <- new_ff_path
matplot(bff$initial_SVD$V[,1:4],type='l',lty=1)