R: Plot discrete observation frequencies per sample and per...

freqPlot {dittoViz}

R Documentation

Plot discrete observation frequencies per sample and per grouping

Description

Plot discrete observation frequencies per sample and per grouping

Usage

freqPlot(
  data_frame,
  var,
  sample.by = NULL,
  group.by,
  color.by = group.by,
  vars.use = NULL,
  scale = c("percent", "count"),
  max.normalize = FALSE,
  plots = c("boxplot", "jitter"),
  split.nrow = NULL,
  split.ncol = NULL,
  split.adjust = list(),
  rows.use = NULL,
  data.out = FALSE,
  data.only = FALSE,
  do.hover = FALSE,
  hover.round.digits = 5,
  color.panel = dittoColors(),
  colors = seq_along(color.panel),
  y.breaks = NULL,
  min = 0,
  max = NA,
  var.labels.rename = NULL,
  var.labels.reorder = NULL,
  x.labels = NULL,
  x.labels.rotate = TRUE,
  x.reorder = NULL,
  theme = theme_classic(),
  xlab = group.by,
  ylab = "make",
  main = "make",
  sub = NULL,
  jitter.size = 1,
  jitter.width = 0.2,
  jitter.color = "black",
  jitter.position.dodge = boxplot.position.dodge,
  do.raster = FALSE,
  raster.dpi = 300,
  boxplot.width = 0.4,
  boxplot.color = "black",
  boxplot.show.outliers = NA,
  boxplot.outlier.size = 1.5,
  boxplot.fill = TRUE,
  boxplot.position.dodge = vlnplot.width,
  boxplot.lineweight = 1,
  vlnplot.lineweight = 1,
  vlnplot.width = 1,
  vlnplot.scaling = "area",
  vlnplot.quantiles = NULL,
  ridgeplot.lineweight = 1,
  ridgeplot.scale = 1.25,
  ridgeplot.ymax.expansion = NA,
  ridgeplot.shape = c("smooth", "hist"),
  ridgeplot.bins = 30,
  ridgeplot.binwidth = NULL,
  add.line = NULL,
  line.linetype = "dashed",
  line.color = "black",
  legend.show = TRUE,
  legend.title = color.by
)

Arguments

`data_frame`	A data_frame where columns are features and rows are observations you might wish to visualize.
`var`	Single string representing the name of a column of `data_frame` that contains the discrete data you wish to quantify as frequencies.
`sample.by`	Single string representing the name of a column of `data_frame` that contains an indicator of which sample each observation belongs to. Note that when this is not provided, there will only be one data point per grouping. A warning can be expected then for all `plots` options except `"jitter"`.
`group.by`	Single string representing the name of a column of `data_frame` containing discrete data to use for separating the data points into groups.
`color.by`	Single string representing the name of a column of `data_frame` containing discrete data to use for setting data representation color fills. This data does not need to be the same as `group.by`, which is great for highlighting supersets or subgroups when wanted, but it defaults to `group.by` so the input can often be skipped.
`vars.use`	String or string vector naming a subset of the values of `var`-data which should be shown. If left as `NULL`, all values are shown. Hint: use `colLevels` or `unique(data_frame[,var])` to assess options. Note: When `var.labels.rename` is jointly utilized to update how the `var`-values are shown, the updated values must be used.
`scale`	"count" or "percent". Sets whether data should be shown as counts versus percentage.
`max.normalize`	Logical which sets whether the data for each `var`-data value (each facet) should be normalized to have the same maximum value. When set to `TRUE`, lower frequency `var`-values will make use of just as much plot space as higher frequency vars. Note: Similarly equal plot space utilization can be achieved by using `split.adjust = list(scales = "free_y")`, and that alternative route retains original values of the data.
`plots`	String vector which sets the types of plots to include: possibilities = "jitter", "boxplot", "vlnplot", "ridgeplot". Order matters: c("vlnplot", "boxplot", "jitter") will put a violin plot in the back, boxplot in the middle, and then individual dots in the front. See details section for more info.
`split.nrow`, `split.ncol`	Integers which set the dimensions of the facet grid.
`split.adjust`	A named list which allows extra parameters to be pushed through to the faceting function call. List elements should be valid inputs to the faceting function `facet_wrap`, e.g. 'list(scales = "free_y")'. See `facet_wrap` for options.
`rows.use`	String vector of rownames of `data_frame` OR an integer vector specifying the row-indices of data points which should be plotted. Alternatively, a Logical vector, the same length as the number of rows in `data_frame`, where `TRUE` values indicate which rows to plot.
`data.out`	Logical. When set to `TRUE`, changes the output, from the plot alone, to a list containing the plot (`p`), its underlying data (`data`).
`data.only`	Logical. When set to `TRUE`, the underlying data will be returned, but not the plot itself.
`do.hover`	Logical which sets whether the ggplot output should be converted to a ggplotly object with data about individual bars displayed when you hover your cursor over them.
`hover.round.digits`	Integer number specifying the number of decimal digits to round displayed numeric values to, when `do.hover` is set to `TRUE`.
`color.panel`	String vector which sets the colors to draw from for data representation fills. Default = `dittoColors()`. A named vector can be used if names are matched to the distinct values of the `color.by` data.
`colors`	Integer vector, the indexes / order, of colors from `color.panel` to actually use. Useful for quickly swapping around colors of the default set (when not using names for color matching).
`y.breaks`	Numeric vector, a set of breaks that should be used as major grid lines. c(break1,break2,break3,etc.).
`min`, `max`	Scalars which control the zoom on the continuous axis of the plot.
`var.labels.rename`	String vector for renaming the distinct identities of `var`-values. This vector must be the same length as the number of levels or unique values in the `var`-data. Hint: use `colLevels` or `unique(data_frame[,var])` to original values.
`var.labels.reorder`	Integer vector. A sequence of numbers, from 1 to the number of distinct `var`-value identities, for rearranging the order of facets within the plot space. Method: Make a first plot without this input. Then, treating the top-left-most grouping as index 1, and the bottom-right-most as index n. Values of `var.labels.reorder` should be these indices, but in the order that you would like them rearranged to be.
`x.labels`	String vector, c("label1","label2","label3",...) which overrides the names of groupings.
`x.labels.rotate`	Logical which sets whether the labels should be rotated. Default: `TRUE` for violin and box plots, but `FALSE` for ridgeplots.
`x.reorder`	Integer vector. A sequence of numbers, from 1 to the number of groupings, for rearranging the order of x-axis groupings. Method: Make a first plot without this input. Then, treating the leftmost grouping as index 1, and the rightmost as index n. Values of x.reorder should be these indices, but in the order that you would like them rearranged to be. Recommendation for advanced users: If you find yourself coming back to this input too many times, an alternative solution that can be easier long-term is to make the target data into a factor, and to put its levels in the desired order: `factor(data, levels = c("level1", "level2", ...))`.
`theme`	A ggplot theme which will be applied before internal adjustments. Default = `theme_classic()`. See https://ggplot2.tidyverse.org/reference/ggtheme.html for other options and ideas.
`xlab`	String which sets the grouping-axis label (=x-axis for box and violin plots, y-axis for ridgeplots). Set to `NULL` to remove.
`ylab`	String, sets the continuous-axis label (=y-axis for box and violin plots, x-axis for ridgeplots). Default = "make" and if left as make, this title will be automatically generated.
`main`	String, sets the plot title. Default = "make" and if left as make, a title will be automatically generated. To remove, set to `NULL`.
`sub`	String, sets the plot subtitle.
`jitter.size`	Scalar which sets the size of the jitter shapes.
`jitter.width`	Scalar that sets the width/spread of the jitter in the x direction. Ignored in ridgeplots. Note for when `color.by` is used to split x-axis groupings into additional bins: ggplot does not shrink jitter widths accordingly, so be sure to do so yourself! Ideally, needs to be 0.5/num_subgroups.
`jitter.color`	String which sets the color of the jitter shapes
`jitter.position.dodge`	Scalar which adjusts the relative distance between jitter widths when multiple subgroups exist per `group.by` grouping (a.k.a. when `group.by` and `color.by` are not equal). Similar to `boxplot.position.dodge` input & defaults to the value of that input so that BOTH will actually be adjusted when only, say, `boxplot.position.dodge = 0.3` is given.
`do.raster`	Logical. When set to `TRUE`, rasterizes the jitter plot layer, changing it from individually encoded points to a flattened set of pixels. This can be useful for editing in external programs (e.g. Illustrator) when there are many thousands of data points.
`raster.dpi`	Number indicating dots/pixels per inch (dpi) to use for rasterization. Default = 300.
`boxplot.width`	Scalar which sets the width/spread of the boxplot in the x direction
`boxplot.color`	String which sets the color of the lines of the boxplot
`boxplot.show.outliers`	Logical, whether outliers should by including in the boxplot. Default is `FALSE` when there is a jitter plotted, `TRUE` if there is no jitter.
`boxplot.outlier.size`	Scalar which adjusts the size of points used to mark outliers.
`boxplot.fill`	Logical, whether the boxplot should be filled in or not. Known bug: when boxplot fill is turned off, outliers do not render.
`boxplot.position.dodge`	Scalar which adjusts the relative distance between boxplots when multiple are drawn per grouping (a.k.a. when `group.by` and `color.by` are not equal). By default, this input actually controls the value of `jitter.position.dodge` unless the `jitter` version is provided separately.
`boxplot.lineweight`	Scalar which adjusts the thickness of boxplot lines.
`vlnplot.lineweight`	Scalar which sets the thickness of the line that outlines the violin plots.
`vlnplot.width`	Scalar which sets the width/spread of violin plots in the x direction
`vlnplot.scaling`	String which sets how the widths of the of violin plots are set in relation to each other. Options are "area", "count", and "width". If the default is not right for your data, I recommend trying "width". For an explanation of each, see `geom_violin`.
`vlnplot.quantiles`	Single number or numeric vector of values in [0,1] naming quantiles at which to draw a horizontal line within each violin plot. Example: `c(0.1, 0.5, 0.9)`
`ridgeplot.lineweight`	Scalar which sets the thickness of the ridgeplot outline.
`ridgeplot.scale`	Scalar which sets the distance/overlap between ridgeplots. A value of 1 means the tallest density curve just touches the baseline of the next higher one. Higher numbers lead to greater overlap. Default = 1.25
`ridgeplot.ymax.expansion`	Scalar which adjusts the minimal space between the topmost grouping and the top of the plot in order to ensure the curve is not cut off by the plotting grid. The larger the value, the greater the space requested. When left as NA, dittoViz will attempt to determine an ideal value itself based on the number of groups & linear interpolation between these goal posts: #groups of 3 or fewer: 0.6; #groups=12: 0.1; #groups or 34 or greater: 0.05.
`ridgeplot.shape`	Either "smooth" or "hist", sets whether ridges will be smoothed (the typical, and default) versus rectangular like a histogram. (Note: as of the time shape "hist" was added, combination of jittered points is not supported by the `stat_binline` that dittoViz relies on.)
`ridgeplot.bins`	Integer which sets how many chunks to break the x-axis into when `ridgeplot.shape = "hist"`. Overridden by `ridgeplot.binwidth` when that input is provided.
`ridgeplot.binwidth`	Integer which sets the width of chunks to break the x-axis into when `ridgeplot.shape = "hist"`. Takes precedence over `ridgeplot.bins` when provided.
`add.line`	numeric value(s) where one or multiple line(s) should be added
`line.linetype`	String which sets the type of line for `add.line`. Defaults to "dashed", but any ggplot linetype will work.
`line.color`	String that sets the color(s) of the `add.line` line(s)
`legend.show`	Logical. Whether the legend should be displayed. Default = `TRUE`.
`legend.title`	String or `NULL`, sets the title for the main legend which includes colors and data representations.

Details

The function creates a dataframe containing counts and percent makeup of var identities per sample if sample.by is given, or per group if only group.by is given. color.by can optionally be used to add subgroupings to calculations and ultimate plots, or to convey super-groups of group.by groupings.

Typically, var might target clustering or observation-type annotations, but in truth it can be given any discrete data.

If a set of rows to use was indicated with the rows.use input, only the targeted rows are used for counts and percent makeup calculations. In other words, the row.use input adjusts the universe that frequencies are calculated within.

If a set of var-values to show is indicated with the vars.use input, the data.frame is trimmed at the end to include only the corresponding rows. Thus, this input does not affect the universe for frequency calculation.

If max.normalized is set to TRUE, counts and percent data are transformed to a 0-1 scale, which is one method for making better use of white space for lower frequency var-values. Alternatively, split.adjust = list(scales = "free_y") can be used to achieve the same white-space utilization while retaining original data values.

Either percent of total (scale = "percent"), which is the default, or counts (if scale = "count") data is then (gg)plotted with the data representation types in plots by utilizing the same machinery as yPlot. Faceting by var-data values is utilized to achieve per var-value (e.g. cluster) granularity.

See below for additional customization options!

Value

A ggplot plot where frequencies of discrete var-data per sample, grouped by condition, timepoint, etc., is shown on the y-axis by a violin plot, boxplot, and/or jittered points, or on the x-axis by a ridgeplot with or without jittered points.

Alternatively, if data.out = TRUE, a list containing the plot ("p") and a dataframe of the underlying data ("data").

Alternatively, if do.hover = TRUE, a plotly conversion of the ggplot output in which underlying data can be retrieved upon hovering the cursor over the plot.

Calculation Details

The function is restricted in that each samples' observations, indicated by the unique values of sample.by-data, must exist within single group.by and color.by groupings. Thus, in order to ensure all valid var-data composition data points are generated, prior to calculations...

var-data are ensured to be a factor, which ensures a calculation will be run for every var-value (a.k.a. cluster)
group.by-data and color-by-data are treated as non-factor data, which ensures that calculations are run only for the groupings that each sample is associated with.

Plot Customization

The plots argument determines the types of data representation that will be generated, as well as their order from back to front. Options are "jitter", "boxplot", "vlnplot", and "ridgeplot".

Each plot type has specific associated options which are controlled by variables that start with their associated string. For example, all jitter adjustments start with "jitter.", such as jitter.size and jitter.width.

Inclusion of "ridgeplot" overrides "boxplot" and "vlnplot" presence and changes the plot to be horizontal.

Additionally:

Colors can be adjusted with color.panel.
Subgroupings: color.by can be utilized to split major group.by groupings into subgroups. When this is done in y-axis plotting, dittoViz automatically ensures the centers of all geoms will align, but users will need to manually adjust jitter.width to less than 0.5/num_subgroups to avoid overlaps. There are also three inputs through which one can use to control geom-center placement, but the easiest way to do all at once so is to just adjust vlnplot.width! The other two: boxplot.position.dodge, and jitter.position.dodge.
Line(s) can be added at single or multiple value(s) by providing these values to add.line. Linetype and color are set with line.linetype, which is "dashed" by default, and line.color, which is "black" by default.
Titles and axes labels can be adjusted with main, sub, xlab, ylab, and legend.title arguments.
The legend can be hidden by setting legend.show = FALSE.
y-axis zoom and tick marks can be adjusted using min, max, and y.breaks.
x-axis labels and groupings can be changed / reordered using x.labels and x.reorder, and rotation of these labels can be turned on/off with x.labels.rotate = TRUE/FALSE.

Author(s)

Daniel Bunis

Examples

example("dittoExampleData", echo = FALSE)

# There are three main inputs for this function, in addition to 'data_frame'.
#  var = typically this will be observation-type annotations or clustering
#    This is the set of observations for which we will calculate frequencies
#    (per each unique value of this data) within each sample
#  sample.by = the name of a column containing sample assignments
#    We'll treat all observations with the same value in this column as part
#    of the same sample.
#  group.by = how to group samples together
freqPlot(example_df,
    var = "clustering",
    sample.by = "sample",
    group.by = "category")

# 'color.by' can also be set differently from 'group.by' to have the effect
#  of highlighting supersets or subgroupings:
freqPlot(example_df, "clustering",
    group.by = "category",
    sample.by = "sample",
    color.by = "subcategory")

# The var-values shown can be subset with 'vars.use'
freqPlot(example_df, "clustering",
    group.by = "category", sample.by = "sample", color.by = "subcategory",
    vars.use = 1:2)

# Particular observations can be ignored from calculations and plotting using
#   the 'rows.use' input. Note that doing so adjusts the universe in which
#   frequencies are calculated; all frequencies will now be in terms of freq.
#   out of the rows.use cells.
#   This can be useful for quantifying subtypes within a given supertype,
#     rather than per all observations.
#   For our example, we'll calculate among clusters 1 and 2, treating clusters 3
#     and 4 observations as part of an unwanted other group of data. You'll
#     notice that frequencies are higher here than when we used 'vars.use' in
#     the previous example.
freqPlot(example_df, "clustering",
    group.by = "category", sample.by = "sample", color.by = "subcategory",
    rows.use = example_df$clustering %in% 1:2)

# Lower frequency targets can be expanded to use the entire y-axis by:
#  turning on 'max.normalize'-ation:
freqPlot(example_df, "clustering",
    group.by = "category", sample.by = "sample", color.by = "subcategory",
    max.normalize = TRUE)
#  or by setting y-scale limits to be set by the contents of facets:
freqPlot(example_df, "clustering",
    group.by = "category", sample.by = "sample", color.by = "subcategory",
    split.adjust = list(scales = "free_y"))

# Data representations can also be selected and reordered with the 'plots'
#  input, and further adjusted with inputs applying to each representation.
freqPlot(example_df,
    var = "clustering", sample.by = "sample", group.by = "category",
    plots = c("vlnplot", "boxplot", "jitter"),
    vlnplot.lineweight = 0.2,
    boxplot.fill = FALSE,
    boxplot.lineweight = 0.2)

# Finally, 'sample.by' is not technically required. When not given, a
#  single data point of overall composition stats will be shown for each
#  grouping.
#  Just note, all data representation other than "jitter" will complain
#  due to there only being the one datapoint per group unless you set
#  plots to "jitter".
freqPlot(example_df,
    var = "clustering", group.by = "category", color.by = "subcategory",
    plots = "jitter")

[Package dittoViz version 1.0.1 Index]

Plot discrete observation frequencies per sample and per grouping

Description

Usage

Arguments

Details

Value

Calculation Details

Plot Customization

Author(s)

See Also

Examples