R: scatter plot where observations are grouped into hexagonal...

scatterHex {dittoViz}

R Documentation

scatter plot where observations are grouped into hexagonal bins and then summarized

Description

scatter plot where observations are grouped into hexagonal bins and then summarized

Usage

scatterHex(
  data_frame,
  x.by,
  y.by,
  color.by = NULL,
  bins = 30,
  color.method = NULL,
  split.by = NULL,
  rows.use = NULL,
  color.panel = dittoColors(),
  colors = seq_along(color.panel),
  x.adjustment = NULL,
  y.adjustment = NULL,
  color.adjustment = NULL,
  x.adj.fxn = NULL,
  y.adj.fxn = NULL,
  color.adj.fxn = NULL,
  multivar.split.dir = c("col", "row"),
  split.nrow = NULL,
  split.ncol = NULL,
  split.adjust = list(),
  min.density = NA,
  max.density = NA,
  min.color = "#F0E442",
  max.color = "#0072B2",
  min.opacity = 0.2,
  max.opacity = 1,
  min = NA,
  max = NA,
  rename.color.groups = NULL,
  xlab = x.by,
  ylab = y.by,
  main = "make",
  sub = NULL,
  theme = theme_bw(),
  do.contour = FALSE,
  contour.color = "black",
  contour.linetype = 1,
  do.ellipse = FALSE,
  do.label = FALSE,
  labels.size = 5,
  labels.highlight = TRUE,
  labels.repel = TRUE,
  labels.split.by = split.by,
  labels.repel.adjust = list(),
  add.trajectory.by.groups = NULL,
  add.trajectory.curves = NULL,
  trajectory.group.by,
  trajectory.arrow.size = 0.15,
  add.xline = NULL,
  xline.linetype = "dashed",
  xline.color = "black",
  add.yline = NULL,
  yline.linetype = "dashed",
  yline.color = "black",
  legend.show = TRUE,
  legend.color.title = "make",
  legend.color.breaks = waiver(),
  legend.color.breaks.labels = waiver(),
  legend.density.title = "Observations",
  legend.density.breaks = waiver(),
  legend.density.breaks.labels = waiver(),
  show.grid.lines = TRUE,
  data.out = FALSE
)

Arguments

`data_frame`	A data_frame where columns are features and rows are observations you might wish to visualize.
`x.by`, `y.by`	Single strings denoting the name of a column of `data_frame` containing numeric data to use for the x- and y-axis of the scatterplot.
`color.by`	Single string denoting the name of a column of `data_frame` to use, instead of point density, for setting the color of plotted hexagons. Alternatively, a string vector naming multiple such columns of data to plot at once.
`bins`	Numeric or numeric vector giving the number of hexagonal bins in the x and y directions. Set to 30 by default.
`color.method`	Single string that specifies how `color.by` data should be summarized per each hexagonal bin. Options, and the default, depend on whether the `color.by`-data is continuous versus discrete: Continuous: String naming a function for how target data should be summarized for each bin. Can be any function that inputs (summarizes) a numeric vector and outputs a single numeric value. Default is `median`. Other useful options are `sum`, `mean`, `sd`, or `max`. You can also use a custom function as long as you give it a name; e.g. first run `logsum <- function(x) { log(sum(x)) }` externally, then give `color.method = "logsum"` Discrete: A string signifying whether the color should (default) be simply based on the "max" grouping of the bin, or based on the "max.prop"ortion of observations belonging to any grouping.
`split.by`	1 or 2 strings denoting the name(s) of column(s) of `data_frame` containing discrete data to use for faceting / separating data points into separate plots. When 2 columns are named, c(row,col), the first is used as rows and the second is used for columns of the resulting facet grid. When 1 column is named, shape control can be achieved with `split.nrow` and `split.ncol`
`rows.use`	String vector of rownames of `data_frame` OR an integer vector specifying the row-indices of data points which should be plotted. Alternatively, a Logical vector, the same length as the number of rows in `data_frame`, where `TRUE` values indicate which rows to plot.
`color.panel`	String vector which sets the colors to draw from when `color.by` indicates discrete data. `dittoColors()` by default, see `dittoColors` for contents. A named vector can be used if names are matched to the distinct values of the `color.by` data.
`colors`	Integer vector, the indexes / order, of colors from `color.panel` to actually use. Useful for quickly swapping around colors of the default set (when not using names for color matching).
`x.adjustment`, `y.adjustment`, `color.adjustment`	A recognized string indicating whether numeric `x.by`, `y.by`, and `color.by` data should be used directly (default) or should be adjusted to be "z-score": scaled with the scale() function to produce a relative-to-mean z-score representation "relative.to.max": divided by the maximum value to give percent of max values between [0,1] Ignored if the target data is not numeric as these known adjustments target numeric data only. In order to leave the unedited data available for use in other features, the adjusted data are put in a new column and that new column is used for plotting.
`x.adj.fxn`, `y.adj.fxn`, `color.adj.fxn`	If you wish to apply a function to edit the `x.by`, `y.by`, or `color.by` data before use, in a way not possible with the `color.adjustment` input, this input can be given a function which takes in a vector of values as input and returns a vector of values of the same length as output. For example, `function(x) {log2(x)}` or `as.factor`. In order to leave the unedited data available for use in other features, the adjusted data are put in a new column and that new column is used for plotting.
`multivar.split.dir`	"row" or "col", sets the direction of faceting used for 'var' values when: `var` is given multiple column names AND `split.by` is used to provide an additional feature to facet by
`split.nrow`, `split.ncol`	Integers which set the dimensions of faceting/splitting when faceting by a single feature.
`split.adjust`	A named list which allows extra parameters to be pushed through to the faceting function call. List elements should be valid inputs to the faceting functions, e.g. 'list(scales = "free")'. For options, when giving 1 column to `split.by`, see `facet_wrap`, OR when giving 2 columns to `split.by`, see `facet_grid`.
`min.density`, `max.density`	Number which sets the min/max values used for the density scale. Used no matter whether density is represented through opacity or color.
`min.color`, `max.color`	color for the min/max values of the color scale.
`min.opacity`, `max.opacity`	Scalar between [0,1] which sets the minimum or maximum opacity used for the density legend (when color is used for `color.by` data and density is shown via opacity).
`min`, `max`	Number which sets the values associated with the minimum or maximum color for `color.by` data.
`rename.color.groups`	String vector which sets new names for the identities of `color.by` groups.
`xlab`, `ylab`	Strings which set the labels for the axes. To remove, set to `NULL`.
`main`	String, sets the plot title. The default title is either "Density", `color.by`, or NULL, depending on the identity of `color.by`. To remove, set to `NULL`.
`sub`	String, sets the plot subtitle.
`theme`	A ggplot theme which will be applied before internal adjustments. Default = `theme_bw()`. See https://ggplot2.tidyverse.org/reference/ggtheme.html for other options and ideas.
`do.contour`	Logical. Whether density-based contours should be displayed.
`contour.color`	String that sets the color of the `do.contour` contours.
`contour.linetype`	String or numeric which sets the type of line used for `do.contour` contours. Defaults to "solid", but see `linetype` for other options.
`do.ellipse`	Logical. Whether `color.by` groups should be surrounded by median-centered ellipses.
`do.label`	Logical. Whether to add text labels near the center (median) of `color.by` groups.
`labels.size`	Number which sets the size of labels text when `do.label = TRUE`.
`labels.highlight`	Logical. Whether labels should have a box behind them when `do.label = TRUE`.
`labels.repel`	Logical, that sets whether the labels' placements will be adjusted with ggrepel to avoid intersections between labels and plot bounds when `do.label = TRUE`. TRUE by default.
`labels.split.by`	String of one or two column names which controls the facet-split calculations for label placements. Defaults to `split.by`, so generally there is no need to adjust this except when if you plan to apply faceting externally.
`labels.repel.adjust`	A named list which allows extra parameters to be pushed through to ggrepel function calls. List elements should be valid inputs to the `geom_label_repel` by default, or `geom_text_repel` when `labels.highlight = FALSE`.
`add.trajectory.by.groups`	List of vectors representing trajectory paths, each from start-group to end-group, where vector contents are the group-names indicated by the `trajectory.group.by` column of `data_frame`.
`add.trajectory.curves`	List of matrices, each representing coordinates for a trajectory path, from start to end, where matrix columns represent x and y coordinates of the paths.
`trajectory.group.by`	String denoting the name of a column of `data_frame` to use for generating trajectories from data point groups.
`trajectory.arrow.size`	Number representing the size of trajectory arrows, in inches. Default = 0.15.
`add.xline`	numeric value(s) where one or multiple vertical line(s) should be added.
`xline.linetype`	String which sets the type of line for `add.xline`. Defaults to "dashed", but any ggplot linetype will work.
`xline.color`	String that sets the color(s) of the `add.xline` line(s).
`add.yline`	numeric value(s) where one or multiple vertical line(s) should be added.
`yline.linetype`	String which sets the type of line for `add.yline`. Defaults to "dashed", but any ggplot linetype will work.
`yline.color`	String that sets the color(s) of the `add.yline` line(s).
`legend.show`	Logical. Whether any legend should be displayed. Default = `TRUE`.
`legend.density.title`, `legend.color.title`	Strings which set the title for the legends.
`legend.density.breaks`, `legend.color.breaks`	Numeric vector which sets the discrete values to label in the density and color.by legends.
`legend.density.breaks.labels`, `legend.color.breaks.labels`	String vector, with same length as `legend.*.breaks`, which sets the labels for the tick marks or hex icons of the associated legend.
`show.grid.lines`	Logical which sets whether grid lines should be shown within the plot space.
`data.out`	Logical. When set to `TRUE`, changes the output from the plot alone to a list containing the plot ("plot"), and data.frame of the underlying data for target observations ("data"), and the ultimately used mapping of columns to given aesthetic sets, because modification of newly made columns is required for many features ("cols_used").

Details

This function first makes any requested adjustments to data in the given data_frame, internally only, such as scaling the color.by-column if color.adjustment was given "z-score".

Next, data_frame is then subset to only target rows based on the rows.use input.

Finally, a hex plot is created using this dataframe:

If color.by is not rovided, coloring is based on the density of observations within each hex bin. When color.by is provided, density is represented through opacity while coloring is based on a summarization, chosen with the color.method input, of the target color.by data.

If split.by was used, the plot will be split into a matrix of panels based on the associated groupings.

Value

A ggplot object where colored hexagonal bins are used to summarize observations in a scatter plot.

Alternatively, if data.out=TRUE, a list containing three slots is output: the plot (named 'plot'), a data.table containing the updated underlying data for target rows (named 'data'), and a list providing mappings of final column names in 'data' to given plot aesthetics (named 'cols_used'), because modification of newly made columns is required for many features.

Many characteristics of the plot can be adjusted using discrete inputs

Colors: min.color and max.color adjust the colors for continuous data.
For discrete color.by plotting with color.method = "max", colors are instead adjusted with color.panel and/or colors & the labels of the groupings can be changed using rename.color.groups.
Titles and axes labels can be adjusted with main, sub, xlab, ylab, and legend.color.title and legend.density.title arguments.
Legends can also be adjusted in other ways, using variables that all start with "legend." for easy tab completion lookup.

Additional Features

Other tweaks and features can be added as well. Each is accessible through 'tab' autocompletion starting with "do."--- or "add."---, and if additional inputs are involved in implementing or tweaking these, the associated inputs will start with the "---.":

If do.contour is provided, density gradient contour lines will be overlaid with color and linetype adjustable via contour.color and contour.linetype.
If add.trajectory.by.groups is provided a list of vectors (each vector being group names from start-group-name to end-group-name), and a column name pointing to the relevant grouping information is provided to trajectory.group.by, then median centers of the groups will be calculated and arrows will be overlayed to show trajectory inference paths.
If add.trajectory.curves is provided a list of matrices (each matrix containing x, y coordinates from start to end), paths and arrows will be overlayed to show trajectory inference curves. Arrow size is controlled with the trajectory.arrow.size input.

Author(s)

Daniel Bunis with some code adapted from Giuseppe D'Agostino

Examples

example("dittoExampleData", echo = FALSE)

# The minimal inputs for scatterHex are the 'data_frame', and 2 column names,
#   given to 'x.by' and 'y.by', indicating which data to use for the x and y
#   axes, respectively.
scatterHex(
    example_df, x.by = "PC1", y.by = "PC2")

# 'color.by' can also be given a column name in order to represent that
#   column's data in the color of the hexes.
# Note: This capability requires the suggested package 'ggplot.multistats'.
if (requireNamespace("ggplot.multistats", quietly = TRUE)) {
    scatterHex(
        example_df, x.by = "PC1", y.by = "PC2",
        color.by = "groups")
}
if (requireNamespace("ggplot.multistats", quietly = TRUE)) {
    scatterHex(
        example_df, x.by = "PC1", y.by = "PC2",
        color.by = "gene1")
}

# Data can be "split" or faceted by a discrete variable as well.
scatterHex(example_df, x.by = "PC1", y.by = "PC2",
    split.by = "timepoint") # single split.by element
scatterHex(example_df, x.by = "PC1", y.by = "PC2",
    split.by = c("groups","SNP")) # row and col split.by elements

# Modify the look with intuitive inputs
scatterHex(example_df, x.by = "PC1", y.by = "PC2",
    show.grid.lines = FALSE,
    ylab = NULL, xlab = "PC2 by PC1",
    main = "Plot Title",
    sub = "subtitle",
    legend.density.title = "Items")
# 'max.density' is one of these intuitively named inputs that can be
#   extremely useful for saying "I only can for opacity to be decreased
#   in regions with exceptionally low observation numbers."
# (A good value for this in "real" data might be 10 or 50 or higher, but for
#   our sparse example data, we need to do a lot to show this off at all!)
if (requireNamespace("ggplot.multistats", quietly = TRUE)) {
    scatterHex(
        example_df, x.by = "PC1", y.by = "PC2",
        color.by = "gene1", bins = 10,
        sub = "Default density scale")
}
if (requireNamespace("ggplot.multistats", quietly = TRUE)) {
    scatterHex(
        example_df, x.by = "PC1", y.by = "PC2",
        color.by = "gene1", bins = 10,
        sub = "Density capped low for ignoring sparse regions",
        max.density = 2)
}

# You can restrict to only certain data points using the 'rows.use' input.
#   The input can be given rownames, indexes, or a logical vector
scatterHex(example_df, x.by = "PC1", y.by = "PC2",
    sub = "show only first 40 observations, by index",
    rows.use = 1:40)
scatterHex(example_df, x.by = "PC1", y.by = "PC2",
    sub = "show only 3 obs, by name (plotting gets a bit wonky for few points)",
    rows.use = c("obs1", "obs2", "obs25"))
scatterHex(example_df, x.by = "PC1", y.by = "PC2",
    sub = "show groups A,B,D only, by logical",
    rows.use = example_df$groups!="C")

# Many extra features are easy to add as well:
#   Each is started via an input starting with 'do.FEATURE*' or 'add.FEATURE*'
#   And when tweaks for that feature are possible, those inputs will start be
#   named starting with 'FEATURE*'. For example, color.by groups can be labeled
#   with 'do.label = TRUE' and the tweaks for this feature are given with inputs
#   'labels.size', 'labels.highlight', and 'labels.repel':
if (requireNamespace("ggplot.multistats", quietly = TRUE)) {
    scatterHex(example_df, x.by = "PC1", y.by = "PC2", color.by = "groups",
        sub = "default labeling",
        do.label = TRUE)          # Turns on the labeling feature
}
if (requireNamespace("ggplot.multistats", quietly = TRUE)) {
    scatterHex(example_df, x.by = "PC1", y.by = "PC2", color.by = "groups",
        sub = "tweaked labeling",
        do.label = TRUE,          # Turns on the labeling feature
        labels.size = 8,          # Adjust the text size of labels
        labels.highlight = FALSE, # Removes white background behind labels
        labels.repel = FALSE)     # Turns off anti-overlap location adjustments
}

# Faceting can also be used to show multiple continuous variables side-by-side
#   by giving a vector of column names to 'color.by'.
#   This can also be combined with 1 'split.by' variable, with direction then
#   controlled via 'multivar.split.dir':
if (requireNamespace("ggplot.multistats", quietly = TRUE)) {
    scatterHex(example_df, x.by = "PC1", y.by = "PC2", bins = 10,
        color.by = c("gene1", "gene2"))
}
if (requireNamespace("ggplot.multistats", quietly = TRUE)) {
    scatterHex(example_df, x.by = "PC1", y.by = "PC2", bins = 10,
        color.by = c("gene1", "gene2"),
        split.by = "groups")
}
if (requireNamespace("ggplot.multistats", quietly = TRUE)) {
    scatterHex(example_df, x.by = "PC1", y.by = "PC2", bins = 10,
        color.by = c("gene1", "gene2"),
        split.by = "groups",
        multivar.split.dir = "row")
}

# Sometimes, it can be useful for external editing or troubleshooting purposes
#   to see the underlying data that was directly used for plotting.
# 'data.out = TRUE' can be provided in order to obtain not just plot ("plot"),
#   but also the "data" and "cols_used" returned as a list.
out <- scatterHex(example_df, x.by = "PC1", y.by = "PC2",
    rows.use = 1:40,
    data.out = TRUE)
out$plot
summary(out$data)
out$cols_use

[Package dittoViz version 1.0.1 Index]

scatter plot where observations are grouped into hexagonal bins and then summarized

Description

Usage

Arguments

Details

Value

Many characteristics of the plot can be adjusted using discrete inputs

Additional Features

Author(s)

See Also

Examples