R: Remove non-common genes from data frame

filter_common_genes {tidyestimate}

R Documentation

Remove non-common genes from data frame

Description

As ESTIMATE score calculation is sensitive to the number of genes used, a set of common genes used between six platforms has been established (see ?tidyestimate::common_genes). This function will filter for only those genes.

Usage

filter_common_genes(
  df,
  id = c("entrezgene_id", "hgnc_symbol"),
  tidy = FALSE,
  tell_missing = TRUE,
  find_alias = FALSE
)

Arguments

`df`	a `data.frame` of RNA expression values, with columns corresponding to samples, and rows corresponding to genes. Either rownames or the first column can contain gene IDs (see `tidy`)
`id`	either `"entrezgene_id"` or `"hgnc_symbol"`, whichever `df` contains.
`tidy`	logical. If rownames contain gene identifier, set `FALSE`. If first column contains gene identifier, set `TRUE`
`tell_missing`	logical. If `TRUE`, prints message of genes in common gene set that are not in supplied data frame.
`find_alias`	logical. If `TRUE` and `id = "hgnc_symbol"`, will attempt to find if genes missing from `common_genes` are going under an alias. See details for more information.

Details

The find_aliases argument will attempt to find aliases for HGNC symbols in tidyestimate::common_genes but missing from the provided dataset. This will only run if find_aliases = TRUE and id = "hgnc_symbol".

This algorithm is very conservative: It will only make a match if the gene from the common genes has only one alias that matches with only one gene from the provided dataset, and the gene from the provided dataset with which it matches only matches with a single gene from the list of common genes. (Note that a single gene may have many aliases). Once a match has been made, the gene in the provided dataset is updated to the gene name in the common gene list.

While this method is fairly accurate, is is also a heuristic. Therefore, it is disabled by default. Users should check which genes are becoming reassigned to ensure accuracy.

The method of generation of these aliases can be found at ?tidyestimate::common_genes

Value

A tibble, with gene identifiers as the first column

Examples

filter_common_genes(ov, id = "hgnc_symbol", tidy = FALSE, tell_missing = TRUE, find_alias = FALSE)

[Package tidyestimate version 1.1.1 Index]