dCUR {dCUR}R Documentation

dCUR

Description

Dynamic CUR is a function that boosts the CUR decomposition varying the k, number of columns, and rows used. Its ultimate purpose is to find the stage which minimizes the relative error. The classic CUR and its extensions can be used in dCUR.

Dynamic CUR is an r package that boosts the CUR decomposition varying the k, the number of columns and rows used, and its final purposes to help find the stage, which minimizes the relative error to reduce matrix dimension. Mahoney & Drineas (2009) identified the singular vectors of the SVD as the PCs' interpretation problem and proposed another type of matrix factorization known as CUR Decomposition (Mahoney & Drineas, 2009; Mahoney, Maggioni, & Drineas, 2008; Bodor, Csabai, Mahoney, & Solymosi, 2012). The goal of CUR Decomposition is to give a better interpretation of the matrix decomposition employing proper variable selection in the data matrix, in a way that yields a simplified structure. Its origins come from analysis in genetics. One example is the one showed in Mahoney & Drineas (2009), in which cancer microarrays highlighted to recognize, based on 5000 variables, genetic patterns in patients with soft tissue tumors analyzed with cDNA microarrays. The objective of this package is to show an alternative to variable selection (columns) or individuals (rows) to the ones developed by Mahoney & Drineas (2009). The idea proposed consists of adjusting the probability distributions to the leverage scores and selecting the best columns and rows that minimize the reconstruction error of the matrix approximation \|A-CUR\|. It also includes a method that recalibrates the relative importance of the leverage scores according to an external variable of the user's interest.

Usage

dCUR(
  data,
  variables,
  standardize = FALSE,
  dynamic_columns = FALSE,
  dynamic_rows = FALSE,
  parallelize = FALSE,
  skip = 0.05,
  ...
)

Arguments

data

a data frame that contains the variables to use in CUR decomposition and other externals variables with which you want to correlate.

variables

correspond to the variables used to compute the leverage scores in CUR analysis. The external variable’s names must not be included. dplyr package notation can be used to specify the variables (see examples).

standardize

logical. If TRUE the data is standardized (by subtracting the average and dividing by the standard deviation)

dynamic_columns

logical. If TRUE, an iterative process begins where leverage scores are computed for the different values from 1 to k main components, as well as from 1 to c (the proportion of columns to be selected from the data matrix).

dynamic_rows

logical. If TRUE, an iterative process begins where leverage scores are computed for the different values from 1 to k main components, as well as from 1 to r (the proportion of rows to be selected from the data matrix).

parallelize

logical.If TRUE the CUR analysis is parallelized.

skip

numeric. It specifies the change ratio of columns and rows to be selected.

...

additional arguments to be passed to CUR.

Details

This function serves as a basis for selecting the best combination of k (principal components), c (number of columns) and r (number of rows), in other words, the stage that minimizes the relative error \frac{||A-CUR||}{||A||}, and thus optimizes the number of columns in the analysis, ensuring a percentage of explained variability of the data matrix and facilitating the interpretation of the data set by reducing the dimensionality of the original matrix.

If skip = 0.1 for each k, it is tested with a column proportion of 0, 0.1, 0.11,0.22,...; the same applies for rows. Given the above, it is recommended not to choose a tiny skip, since this implies doing the CUR analysis for more stages.

Parallelizing the function improves its speed significantly.

Value

CUR returns a list of lists, each one represents a stage, and it contains:

k

Number of principal components with which leverages scores are computed.

columns

number of columns selected.

rows

number of rows selected.

relative_error

relative_error obtained: \frac{||A-CUR||}{||A||}

Author(s)

Cesar Gamboa-Sanabria, Stefany Matarrita-Munoz, Katherine Barquero-Mejias, Greibin Villegas-Barahona, Mercedes Sanchez-Barba and Maria Purificacion Galindo-Villardon.

Cesar Gamboa-Sanabria info@cesargamboasanabria.com

See Also

CUR optimal_stage

Examples



 results <- dCUR::dCUR(data=AASP, variables=hoessem:notabachillerato,
 k=15, rows=0.25, columns=0.25,skip = 0.1, standardize=TRUE,
 cur_method="sample_cur",
 parallelize =TRUE, dynamic_columns  = TRUE,
 dynamic_rows  = TRUE)
 results



[Package dCUR version 1.0.1 Index]