generate_maf_data {ICBioMark} | R Documentation |
Generate mutation data.
Description
A function to randomly simulate an (abridged) annotated mutation file, containing information on sample of origin, gene and mutation type, as well as a dataframe of gene lengths.
Usage
generate_maf_data(
n_samples = 100,
n_genes = 20,
mut_types = NULL,
data_dist = NULL,
sample_rates = NULL,
gene_rates = NULL,
gene_lengths = NULL,
sample_rates_dist = NULL,
gene_rates_dist = NULL,
gene_lengths_dist = NULL,
bmr_genes_prop = 0.7,
output_rates = FALSE,
seed_id = 1234
)
Arguments
n_samples |
(numeric) The number of samples to generate mutation data for - each will have a unique value in the 'Tumor_Sample_Barcode' column of the simulated MAF table. Note that if no mutations are simulated for an example, they will not appear in the table. |
n_genes |
(numeric) The number of genes to generate mutation data for - each will have a unique value in the 'Hugo_Symbol' column of the simulated MAF table. A length will also be generated for each gene, and stored in the table 'gene_lengths'. |
mut_types |
(numeric) A vector of positive values giving the relative average abundance of each mutation type. The names of each mutation type are stored in the names attribute of the vector, and will form the entries of the column 'Variant_Classification' in the output MAF table. |
data_dist |
(function) Directly provide the probability distribution of mutations, as a function on n_samples, n_genes, mut_types, and gene_lengths. |
sample_rates |
(numeric) Directly provide sample-specific rates. |
gene_rates |
(numeric) Directly provide gene-specific rates. |
gene_lengths |
(numeric) Directly provide gene lengths, in the form of a vector of numerics with names attribute corresponding to gene names. |
sample_rates_dist |
(function) Directly provide the distribution of sample-specific rates, as a function of the number of samples. |
gene_rates_dist |
(function) Directly provide the distribution of gene-specific rates, as a function of the number of genes. |
gene_lengths_dist |
(function) Directly provide the distribution of gene lengths, as a function of the number of genes. |
bmr_genes_prop |
(numeric) The proportion of genes that follow the background mutation rate. If specified (as is automatic), this proportion of genes will have gene-specific rates equal to 1. By setting to be NULL, can avoid applying this step. |
output_rates |
(logical) If TRUE, will include the sample and gene rates in the output. |
seed_id |
(numeric) Input value for the function set.seed(). |
Value
A list with two elements, 'maf' and 'gene_lengths'. These are (respectively):
A table with three columns: 'Tumor_Sample_Barcode', 'Hugo_Symbol' and 'Variant_Classification', listing the mutations occurring in the simulated example. gene_lengths (dataframe)
A table with two rows: 'Hugo_Symbol' and 'gene_lengths'.
Examples
# Generate some random data
data <- generate_maf_data(n_samples = 10, n_genes = 20)
# See the first rows of the maf table.
print(head(data$maf))
# See the first rows of the gene_lengths table.
print(head(data$gene_lengths))