GSSTDA {GSSTDA}R Documentation

Gene Structure Survival using Topological Data Analysis (GSSTDA).

Description

Gene Structure Survival using Topological Data Analysis. This function implements an analysis for expression array data based on the Progression Analysis of Disease developed by Nicolau et al. (doi: 10.1073/pnas.1102826108) that allows the information contained in an expression matrix to be condensed into a combinatory graph. The novelty is that information on survival is integrated into the analysis.

The analysis consists of 3 parts: a preprocessing of the data, the gene selection and the filter function, and the mapper algorithm. The preprocessing is specifically the Disease Specific Genomic Analysis (proposed by Nicolau et al.) that consists of, through linear models, eliminating the part of the data that is considered "healthy" and keeping only the component that is due to the disease. The genes are then selected according to their variability and whether they are related to survival and the values of the filtering function for each patient are calculated taking into account the survival associated with each gene. Finally, the mapper algorithm is applied from the disease component matrix and the values of the filter function obtaining a combinatory graph.

Usage

GSSTDA(
  full_data,
  survival_time,
  survival_event,
  case_tag,
  gen_select_type = "Top_Bot",
  percent_gen_select = 10,
  num_intervals = 5,
  percent_overlap = 40,
  distance_type = "cor",
  clustering_type = "hierarchical",
  num_bins_when_clustering = 10,
  linkage_type = "single",
  na.rm = TRUE
)

Arguments

full_data

Input matrix whose columns correspond to the patients and rows to the genes.

survival_time

Numerical vector of the same length as the number of columns of full_data. Patients must be in the same order as in full_data. For the patients with tumour sample should be indicated the time between disease diagnosis and death (if not dead until the end of follow-up) and healthy patients must have an NA value.

survival_event

Numerical vector of the same length as the number of columns of full_data. Patients must be in the same order as in full_data. For the patients with tumour sample should be indicated whether the patient has died (1) or not (0). Only these values are valid and healthy patients must have an NA value.

case_tag

Character vector of the same length as the number of columns of full_data. Patients must be in the same order as in full_data. It must be indicated for each patient whether he/she is healthy or not. One value should be used to indicate whether the patient is healthy and another value should be used to indicate whether the patient's sample is tumourous. The user will then be asked which one indicates whether the patient is healthy. Only two values are valid in the vector in total.

gen_select_type

Option. Options on how to select the genes to be used in the mapper. Select the "Abs" option, which means that the genes with the highest absolute value are chosen, or the "Top_Bot" option, which means that half of the selected genes are those with the highest value (positive value, i.e. worst survival prognosis) and the other half are those with the lowest value (negative value, i.e. best prognosis). "Top_Bot" default option.

percent_gen_select

Percentage (from zero to one hundred) of genes to be selected to be used in mapper. 10 default option.

num_intervals

Parameter for the mapper algorithm. Number of intervals used to create the first sample partition based on filtering values. 5 default option.

percent_overlap

Parameter for the mapper algorithm. Percentage of overlap between intervals. Expressed as a percentage. 40 default option.

distance_type

Parameter for the mapper algorithm. Type of distance to be used for clustering. Choose between correlation ("cor") and euclidean ("euclidean"). "cor" default option.

clustering_type

Parameter for the mapper algorithm. Type of clustering method. Choose between "hierarchical" and "PAM" (“partition around medoids”) options. "hierarchical" default option.

num_bins_when_clustering

Parameter for the mapper algorithm. Number of bins to generate the histogram employed by the standard optimal number of cluster finder method. Parameter not necessary if the "optimal_clust_mode" option is "silhouette" or the "clust_type" is "PAM". 10 default option.

linkage_type

Parameter for the mapper algorithm. Linkage criteria used in hierarchical clustering. Choose between "single" for single-linkage clustering, "complete" for complete-linkage clustering or "average" for average linkage clustering (or UPGMA). Only necessary for hierarchical clustering. "single" default option.

na.rm

logical. If TRUE, NA rows are omitted. If FALSE, an error occurs in case of NA rows. TRUE default option.

Value

A GSSTDA object. It contains:

Examples


GSSTDA_object <- GSSTDA(full_data,  survival_time, survival_event, case_tag,
                 gen_select_type="Top_Bot", percent_gen_select=10,
                 num_intervals = 4, percent_overlap = 50,
                 distance_type = "euclidean", num_bins_when_clustering = 8,
                 clustering_type = "hierarchical", linkage_type = "single")

[Package GSSTDA version 0.1.3 Index]