create_features_df {driveR} | R Documentation |
Create Data Frame of Features for Driver Gene Prioritization
Description
Create Data Frame of Features for Driver Gene Prioritization
Usage
create_features_df(
annovar_csv_path,
scna_df,
phenolyzer_annotated_gene_list_path,
batch_analysis = FALSE,
prep_phenolyzer_input = FALSE,
build = "GRCh37",
log2_ratio_threshold = 0.25,
gene_overlap_threshold = 25,
MCR_overlap_threshold = 25,
hotspot_threshold = 5L,
log2_hom_loss_threshold = -1,
verbose = TRUE,
na.string = "."
)
Arguments
annovar_csv_path |
path to 'ANNOVAR' csv output file |
scna_df |
the SCNA segments data frame. Must contain:
|
phenolyzer_annotated_gene_list_path |
path to 'phenolyzer' "annotated_gene_list" file |
batch_analysis |
boolean to indicate whether to perform batch analysis
( |
prep_phenolyzer_input |
boolean to indicate whether or not to create
a vector of genes for use as the input of 'phenolyzer' (default = |
build |
genome build for the SCNA segments data frame (default = "GRCh37") |
log2_ratio_threshold |
the log2 ratio threshold for keeping high-confidence SCNA events (default = 0.25) |
gene_overlap_threshold |
the percentage threshold for the overlap between a segment and a transcript (default = 25). This means that if only a segment overlaps a transcript more than this threshold, the transcript is assigned the segment's SCNA event. |
MCR_overlap_threshold |
the percentage threshold for the overlap between a gene and an MCR region (default = 25). This means that if only a gene overlaps an MCR region more than this threshold, the gene is assigned the SCNA density of the MCR |
hotspot_threshold |
to determine hotspot genes, the (integer) threshold for the minimum number of cases with certain mutation in COSMIC (default = 5) |
log2_hom_loss_threshold |
to determine double-hit events, the log2 threshold for identifying homozygous loss events (default = -1). |
verbose |
boolean controlling verbosity (default = |
na.string |
string that was used to indicate when a score is not available during annotation with ANNOVAR (default = ".") |
Value
If prep_phenolyzer_input=FALSE
(default), a data frame of
features for prioritizing cancer driver genes (gene_symbol
as
the first column and 26 other columns containing features). If
prep_phenolyzer_input=TRUE
, the functions returns a vector gene symbols
(union of all gene symbols for which scores are available) to be used as the
input for performing 'phenolyzer' analysis.
The features data frame contains the following columns:
- gene_symbol
HGNC gene symbol
- metaprediction_score
the maximum metapredictor (coding) impact score for the gene
- noncoding_score
the maximum non-coding PHRED-scaled CADD score for the gene
- scna_score
SCNA proxy score. SCNA density (SCNA/Mb) of the minimal common region (MCR) in which the gene is located
- hotspot_double_hit
boolean indicating whether the gene is a hotspot gene (indication of oncogenes) or subject to double-hit (indication of tumor-suppressor genes)
- phenolyzer_score
'phenolyzer' score for the gene
- hsa03320
boolean indicating whether or not the gene takes part in this KEGG pathway
- hsa04010
boolean indicating whether or not the gene takes part in this KEGG pathway
- hsa04020
boolean indicating whether or not the gene takes part in this KEGG pathway
- hsa04024
boolean indicating whether or not the gene takes part in this KEGG pathway
- hsa04060
boolean indicating whether or not the gene takes part in this KEGG pathway
- hsa04066
boolean indicating whether or not the gene takes part in this KEGG pathway
- hsa04110
boolean indicating whether or not the gene takes part in this KEGG pathway
- hsa04115
boolean indicating whether or not the gene takes part in this KEGG pathway
- hsa04150
boolean indicating whether or not the gene takes part in this KEGG pathway
- hsa04151
boolean indicating whether or not the gene takes part in this KEGG pathway
- hsa04210
boolean indicating whether or not the gene takes part in this KEGG pathway
- hsa04310
boolean indicating whether or not the gene takes part in this KEGG pathway
- hsa04330
boolean indicating whether or not the gene takes part in this KEGG pathway
- hsa04340
boolean indicating whether or not the gene takes part in this KEGG pathway
- hsa04350
boolean indicating whether or not the gene takes part in this KEGG pathway
- hsa04370
boolean indicating whether or not the gene takes part in this KEGG pathway
- hsa04510
boolean indicating whether or not the gene takes part in this KEGG pathway
- hsa04512
boolean indicating whether or not the gene takes part in this KEGG pathway
- hsa04520
boolean indicating whether or not the gene takes part in this KEGG pathway
- hsa04630
boolean indicating whether or not the gene takes part in this KEGG pathway
- hsa04915
boolean indicating whether or not the gene takes part in this KEGG pathway
See Also
prioritize_driver_genes
for prioritizing cancer driver genes
Examples
path2annovar_csv <- system.file("extdata/example.hg19_multianno.csv",
package = "driveR")
path2phenolyzer_out <- system.file("extdata/example.annotated_gene_list",
package = "driveR")
features_df <- create_features_df(annovar_csv_path = path2annovar_csv,
scna_df = example_scna_table,
phenolyzer_annotated_gene_list_path = path2phenolyzer_out)