Plot {lessR} | R Documentation |
Scatterplots including Time Series and Violin/Box/Scatterplot
Description
Abbreviation:
Violin Plot only: vp
, ViolinPlot
Box Plot only: bx
, BoxPlot
Scatter Plot only: sp
, ScatterPlot
A scatterplot displays the values of a distribution, or the relationship between the two distributions in terms of their joint values, as a set of points in an n-dimensional coordinate system, in which the coordinates of each point are the values of n variables for a single observation (row of data). From the identical syntax, from any combination of continuous or categorical variables variables x
and y
, Plot(x)
or Plot(x,y)
, where x
or y
can be a vector, by default generates a family of related 1- or 2-variable scatterplots, possibly enhanced, as well as related statistical analyses. Define a categorical variable as an R factor. If x
is a Date variable, then a time series is plotted.
Plot
produces a wide variety of scatterplots as outlined in the following list.
Variable Type | Meaning |
-------------------------- | -------------------------------------- |
x , y , or z | single continuous variable |
xDate | date variable, defined as a R Date type |
xCat , yCat , or zCat | categorical variable, typically defined as an R factor |
xUnique or yUnique | categorical variable with all values unique |
X or Y | vector of continuous variables |
Xcat | vector of categorical variables |
-------------------------- | -------------------------------------- |
Two variables
Plot(x,y)
: traditional scatterplot of two continuous variables
Plot(xDate,y)
: a Date variable and a continuous yields a time-series plot
Plot(xCat,yCat)
: to solve the over-plot problem, plot a scatterplot of two categorical variables as a bubble scatterplot, the size of each bubble based on the corresponding joint frequency
Plot(xCat,y)
or Plot(x,yCat)
: one variable categorical and the other variable continuous, yields a scatterplot with means at each level of the categorical variable
Plot(xCat,y, stat="mean")
or Plot(x,yCat, stat="mean")
: one variable categorical and the other variable continuous, yields a Cleveland dot plot with a specified statistic such as the "mean"
of the continuous variable at each level of the categorical variable
Plot(xUnique,y)
or Plot(x,yUnique)
: one categorical with unique (ID) values and the other variable continuous, yields a Cleveland dot (lollipop) plot, where the unique values can be variable row.names
One variable
Plot(x)
: one continuous variable generates either a violin/box/scatterplot (VBS plot), named here, or a run chart with run=TRUE
, or x
can be an R time series object created with ts()
for a time series visualization
Plot(xCat)
: one categorical variable yields a 1-dimensional bubble plot to solve the over-plot problem for a more compact replacement of the traditional bar chart
Three, four, or more variables
Plot(x,y, size=z)
: x
and y
continuous yields a bubble of two continuous variables with z
setting the size of the corresponding plotted point, i.e., bubble
Plot(x,y, by=zCat)
: plots a different scatterplot of x
and y
for each level of zCat
on the same panel
Plot(x,y, by1=zCat)
: plots a different scatterplot of x
and y
for each level of zCat
on separate panels, i.e., Trellis or facet plots
Plot(x,y, by1=z1Cat, by2=z2Cat)
: plots a different scatterplot of x
and y
for each combination of levels of zCat1
and zCat2
on separate panels, i.e., Trellis or facet plots
Plot(X,y)
or Plot(x,Y)
: one vector variable of several continuous variables, paired with another single continuous variable, yields multiple scatterplots on the same graph
Plot(Y,xUnique)
: one categorical with unique (ID) values, such as row.names
and the other variable a vector of continuous variables yields a Cleveland dot plot of all the continuous variables, usually two
One vector
Plot(X)
: one vector of variables, with no y
-variable, results in a scatterplot matrix of the variables
Plot(Xcat)
: one vector of categorical x
-variables, with no y
-variable, generalizes to a matrix of 1-dimensional bubble plots, here called the bubble plot frequency matrix, to replace a series of bar charts
Usage
Plot(
# -------------------------------------
# Data from which to construct the plot
x, y=NULL, data=d, filter=NULL,
# -------------------------------
# Enhancements and customizations
# -------------------------------
# ------------------------------------------------------------------
# Analogy of physical Marks on paper that create the bars and labels
theme=getOption("theme"),
fill=NULL, color=NULL,
transparency=getOption("trans_pt_fill"),
enhance=FALSE,
size=NULL, size_cut=NULL, shape="circle", means=TRUE,
segments=FALSE, segments_y=FALSE, segments_x=FALSE,
# ----------------------
# Sort and jitter points
sort_yx=c("0", "-", "+"),
jitter_x=0, jitter_y=0,
# ----------------
# Outlier analysis
ID="row.name", ID_size=0.60,
MD_cut=0, out_cut=0, out_shape="circle", out_size=1,
# -------------------------------------------------
# Fit line, confidence interval, confidence ellipse
fit=c("off","loess", "lm", "ls", "null", "exp", "quad",
"power", "log"),
fit_power=1, fit_se=0.95,
fit_color=getOption("fit_color"),
plot_errors=FALSE, ellipse=0,
# ----------------------------------------------------------
# Types of plots beyond default scatterplots (x, or x and y)
# ----------------------------------------------------------
# --------------------------------------------------
# Stratification: Same panel or Trellis (facet) plot [x, or x and y]
by=NULL, by1=NULL, by2=NULL,
n_row=NULL, n_col=NULL, aspect="fill",
# -----------------------------------------------------
# Time series, plot x values sequentially [xDate, y or Y]
time_unit=NULL, time_agg=c("sum","mean"), stack=FALSE, lwd=1.5,
area_fill="transparent", area_split=0,
# Run chart
run=FALSE, show_runs=FALSE,
center_line=c("off", "mean", "median", "zero"),
# -----------------------------------
# Lollipop chart from aggregated data [x and y]
stat=c("mean", "sum", "sd", "deviation", "min", "median", "max"),
stat_x=c("count", "proportion", "%"),
# ----------------------------------
# Integrated violin/box/scatter plot [x]
vbs_plot="vbs", vbs_size=0.9, bw=NULL, bw_iter=10,
violin_fill=getOption("violin_fill"),
box_fill=getOption("box_fill"),
vbs_pt_fill="black",
vbs_mean=FALSE, fences=FALSE,
k=1.5, box_adj=FALSE, a=-4, b=3,
# -----------
# Bubble plot [xCat, or xCat and yCat]
radius=NULL, power=0.5, low_fill=NULL, hi_fill=NULL,
# --------------------------------------
# Large data sets, smoothing and binning [x and y]
smooth=FALSE, smooth_points=100, smooth_size=1,
smooth_exp=0.25, smooth_bins=128,
n_bins=1,
# ------------------------------------------------------
# Bins for frequency polygon or text output of VBS plots
bin=FALSE, bin_start=NULL, bin_width=NULL, bin_end=NULL,
breaks="Sturges", cumulate=FALSE,
# -------------
# Miscellaneous
# -------------
# ------------------------------------------------------------------
# Labels for axes, values, and legend if x and by variables, margins
xlab=NULL, ylab=NULL, main=NULL, sub=NULL,
lab_adjust=c(0,0), margin_adjust=c(0,0,0,0),
rotate_x=getOption("rotate_x"), rotate_y=getOption("rotate_y"),
offset=getOption("offset"),
xy_ticks=TRUE, origin_x=NULL,
scale_x=NULL, scale_y=NULL,
pad_x=c(0,0), pad_y=c(0,0),
legend_title=NULL,
# ----------------------------------------------------
# Add one or more objects, text, or geometric figures
add=NULL, x1=NULL, y1=NULL, x2=NULL, y2=NULL,
# ------------------------------------------------------------------
# Output: turn off, chart to PDF file, decimal digits, markdown file
quiet=getOption("quiet"), do_plot=TRUE,
pdf_file=NULL, width=6.5, height=6,
digits_d=NULL,
# -------------------------------------------------------------
# Deprecated, removed in future versions, use R factors instead
n_cat=getOption("n_cat"), value_labels=NULL, rows=NULL,
# -----
# Other
eval_df=NULL, fun_call=NULL, ...)
ScatterPlot(...)
sp(...)
BoxPlot(...)
bx(...)
ViolinPlot(...)
vp(...)
Arguments
x |
By itself, or with |
y |
An optional second primary variable. Variable with values to be mapped to coordinates of points in the plot on the vertical axis. Can be continuous or categorical. Can be in a data frame or defined in the global environment. |
data |
Optional data frame that contains one or both of |
filter |
A logical expression that specifies a subset of rows of the data frame to analyze. |
theme |
Color theme for this analysis. Make persistent across analyses
with |
fill |
Either fill color of the points or the area under a line chart.
Can also set with the lessR function |
color |
Border color of the points or line_color for line plot.
Can be a vector to customize the color for each point or a color
range such as "blues" (see |
transparency |
Transparency factor of the fill color of each point.
Default is
|
enhance |
For a two-variable scatterplot, if |
size |
When set to a constant, the scaling factor for standard points
(not bubbles) or a line, with default of 1.0 for points and 2.0 for a line.
Set to 0 to not plot the points or lines. If |
size_cut |
If |
shape |
The plot character(s). The default value is |
means |
If the one variable is categorical, expressed as a factor, and
the other variable continuous, then if |
segments |
Designed for interaction plots of means, connects each pair of
successive points with a line segment. Pass a data frame of the means,
such as from |
segments_y |
For one |
segments_x |
Draw a line segment from the |
sort_yx |
Sort the values of |
jitter_x |
Randomly perturbs the plotted points of a scatterplot
horizontally within the limits of the explicitly specified value, or
set to |
jitter_y |
Same as |
ID |
Name of variable to provide the labels for the selected
plotted points for outlier identification, row names of data frame
by default. To label all
the points use the |
ID_size |
Size of the plotted labels.
Modify text color of the labels with the |
MD_cut |
Mahalanobis distance cutoff to define an outlier in a 2-variable scatterplot. |
out_cut |
Count or proportion of plotted points to label, in order of their distance from the scatterplot center (means), counting down from the more extreme point. For two-variable plots, assess distance from the center with Mahalanobis distance. For VBS plots of a single continuous variable, refers to outliers on each side of the plot. |
out_shape |
Shape of outlier points in a 2-variable scatterplot
or a VBS plot.
Modify fill color from the current |
out_size |
Size of outlier points in a 2-variable scatterplot or VBS plot. |
fit |
The best fit line. Default value is |
fit_power |
Power that describes response Y as a power function of the
predictor variable X, required for |
fit_se |
Confidence level for the error band displayed around the
line of best fit. On by default at 0.95 if a fit line is specified,
but turned off if |
fit_color |
Color of the fit line. |
plot_errors |
Plot the line segment that joins each point to the regression line, "loess" or "lm", illustrating the size of the residuals. |
ellipse |
Confidence level of a data ellipse for a scatterplot
of only a single
|
by |
A categorical variable to provide a scatterplot for
each level of the numeric primary variables |
by1 |
A categorical variable called a conditioning variable that
activates Trellis graphics, provided by Deepayan Sarkar's (2007) lattice
package, to provide
a separate panel of numeric primary variables |
by2 |
A second conditioning variable to generate Trellis
plots jointly conditioned on both the |
n_row |
Optional specification for the number of rows and columns
in the layout of a multi-panel display with Trellis graphics. Specify
|
n_col |
Optional specification for the number of columns in the
layout of a multi-panel display with
Trellis graphics. Specify |
aspect |
Lattice parameter for the aspect ratio of the panels,
defined as height divided by width.
The default value is |
time_unit |
Specify the time unit from which to plot a time series.
Aggregation according to the time unit will occur as needed, such as
a daily time series aggregated to |
time_agg |
Function by which to aggregate according to |
stack |
If |
lwd |
Width of the line segments. Set to zero to remove the line segments. |
area_fill |
Specifies the area under the line segments, if present.
If |
area_split |
[Applies only to a Trellis plot activated with parameter
|
run |
If set to |
show_runs |
If |
center_line |
Plots a dashed line through the middle of a run chart.
Provides a center line for the |
stat |
Transform data for categorical variable |
stat_x |
If no |
vbs_plot |
A character string that specifies the components of the
integrated Violin-Box-Scatterplot (VBS) of a continuous variable.
A |
vbs_size |
Width of the violin plot to the plot area. Make the violin (and also the accompanying box plot) larger or smaller by making the plot area and/or this value larger or smaller. |
bw |
Bandwidth for the smoothness of the violin plot. Higher values for smoother plots. Default is to calculate a bandwidth that provides a relative smooth density plot. |
bw_iter |
Number of iterations used to modify default R bandwidth to further smooth the obtained density estimate. When set, also displays the iterations and corresponding results. |
violin_fill |
Fill color for a violin plot. |
box_fill |
Fill color for a box plot. |
vbs_pt_fill |
Points in a VBS scatterplot are black by default because
the background is the violin, which is based on the current theme
color. To use the values for |
vbs_mean |
Show the mean on the box plot with a strip the color
of |
fences |
If |
k |
IQR multiplier for the basis of calculating the distance of the whiskers of the box plot from the box. Default is Tukey's setting of 1.5. |
box_adj |
Adjust the box and whiskers, and thus outlier detection, for skewness using the medcouple statistic as the robust measure of skewness according to Hubert and Vandervieren (2008). |
a , b |
Scaling factors for the adjusted box plot to set the length
of the whiskers. If explicitly set, activates |
radius |
Scaling factor of the bubbles in a bubble plot, which
sets the radius of the largest displayed bubble in inches. To
activate, either set the value of |
power |
Relative size of the scaling of the bubbles to each other. Default value of 0.5 scales the bubbles so that the area of each bubble is the value of the corresponding sizing variable. Value of 1 scales so the radius of the bubble is the value of the sizing variable, increasing the discrepancy of size between the variables. |
low_fill |
For a categorical variable and the resulting bubble plot, or a matrix of these plots, sets a color gradient of the fill color beginning with this color. |
hi_fill |
For a categorical variable and the resulting bubble plot, or a matrix of these plots, sets a color gradient of the fill color ending with this color. |
smooth |
Smoothed density plot for two numerical variables. |
smooth_points |
Number of points superimposed on the density plot in the areas of the lowest density to help identify outliers, which controls how dark are the smoothed points. |
smooth_size |
Size of points superimposed on the density plot. |
smooth_exp |
Exponent of the function that maps the density scale to the color scale. Smaller than default of 0.25 yields darker plots. |
smooth_bins |
Number of bins in both directions for the density estimation. |
n_bins |
Specify the number of bins to bin a single numeric
|
bin |
If |
bin_start |
Optional specified starting value of the bins for a
frequency polygon or for the text output of a
Violin-Box-Scatter (VBS) Plot. Also, sets |
bin_width |
Optional specified bin width value. Also, sets
|
bin_end |
Optional specified value that is within the last bin, so the actual endpoint of the last bin may be larger than the specified value. |
breaks |
The method for calculating the bins, or an explicit
specification of the bins, such as with the standard R
|
cumulate |
Specify a cumulative frequency polygon. |
xlab , ylab |
Axis label for |
main |
Label for the title of the graph. If the corresponding variable labels exist, then the title is set by default from the corresponding variable labels. |
sub |
Sub-title of graph, below |
lab_adjust |
Two-element vector – x-axis label, y-axis label – adjusts the position of the axis labels in approximate inches. + values move the labels away from plot edge. Not applicable to Trellis graphics. |
margin_adjust |
Four-element vector – top, right, bottom and left – adjusts the margins of the plotted figure in approximate inches. + values move the corresponding margin away from plot edge. Not applicable to Trellis graphics. |
rotate_x |
Rotation in degrees of the value labels on
the |
rotate_y |
Degrees that the axis values for the value labels on
the |
offset |
The amount of spacing between the axis values and the axis. Default is 0.5. Larger values such as 1.0 are used to create space for the label when longer axis value names are rotated. |
xy_ticks |
Flag that indicates if tick marks and associated value
labels on the axes are to be displayed. To rotate the axis values, use
|
origin_x |
Origin of |
scale_x |
If specified, a vector of three values that define the x-axis with numerical values: starting value, ending value, and number of intervals. |
scale_y |
If specified, a vector of three values that define the y-axis with numerical values: starting value, ending value, and number of intervals. |
pad_x |
Proportion of padding added to left and right sides of the
|
pad_y |
Proportion of padding added to bottom and top sides of the
|
legend_title |
Title of the legend for a multiple-variable |
add |
Overlay one or more objects, text or a geometric figures,
on the plot.
Possible values are any text to be written, the first argument, which is
|
x1 |
First x-coordinate to be considered for each object, can be
|
y1 |
First y-coordinate to be considered for each object, can be
|
x2 |
Second x-coordinate to be considered for each object, can be
|
y2 |
Second y-coordinate to be considered for each object, can be
|
quiet |
If set to |
do_plot |
If |
pdf_file |
Indicate to direct pdf graphics to the specified name of the pdf file. |
width |
Width of the plot window in inches, defaults to 5 except in RStudio to maintain an approximate square plotting area. |
height |
Height of the plot window in inches, defaults to 4.5 except for 1-D scatterplots and when in RStudio. |
digits_d |
Number of significant digits for each of the displayed summary statistics. |
n_cat |
Number of categories, specifies the largest number of unique, equally spaced integer values of a variable for which the variable will be analyzed as categorical instead of continuous. Default is 0. Use to specify that such variables are to be analyzed as categorical, a kind of informal R factor. [deprecated]: Best to convert a categorical integer variable to a factor. |
value_labels |
For factors, default is the factor labels, and for
character variables, default is the character values.
Or, provide labels for the |
rows |
Deprecated old parameter name that is now called |
eval_df |
Determines if to check for existing data frame and
specified variables. By default is |
fun_call |
Function call. Used with |
... |
Other parameter values for non-Trellis graphics as defined by and
processed by standard R functions |
Details
VARIABLES and TRELLIS PLOTS
There is at least one primary variable, x
, which defines the coordinate system for plotting in terms of the x
-axis, the horizontal axis. Plots may also specify a second primary variable, y
, which defines the y
-axis of the coordinate system. One of these primary variables may be a vector. The simplest plot is from the specification of only one or two primary variables, each as a single variable, which generates a single scatterplot of either one or two variables, necessarily on a single plot, called a panel, defined by a single x
-axis and usually a single y
-axis_
For numeric primary variables, a single panel may also contain multiple plots of two types. Form the first type from subsets of observations (rows of data) based on values of a categorical variable. Specify this plot with the by
parameter, which identifies the grouping variable to generate a scatterplot of the primary variables for each of its levels. The points for each group are plotted with a different shape and/or color. By default, the colors vary, though to maintain the color scheme, if there are only two levels of the grouping variable, the points for one level are filled with the current theme color and the points for the second level are plotted with transparent interiors.
Or, obtain multiple scatterplots on the same panel with multiple numeric x
-variables, or multiple y
-variables. To obtain this graph, specify one of the primary variables as a vector of multiple variables.
Trellis graphics (facets), from Deepayan Sarkar's (2009) lattice
package, may be implemented in which multiple panels for one numeric x
-variable and one numeric y
-variable are displayed according to the levels of one or two categorical variables, called conditioning variables. A variable specified with by
is a conditioning variable that results in a Trellis plot, the scatterplot of x
and y
produced at each level of the by1
variable. The inclusion of a second conditioning variable, by2
, results in a separate scatterplot panel for each combination of cross-classified values of both by1
and by2
. A grouping variable according to by
may also be specified, which is then applied to each panel. If there are 1000 or less unique values of x
, an analysis of the maximum number of repetitions for each value of by1
is provided.
Control the panel dimensions and the overall size of the Trellis plot with the following parameters: width
and height
for the physical dimensions of the plot window, n_row
and n_col
for the number of rows and columns of panels, and aspect
for the ratio of the height to the width of each panel. The plot window is the standard graphics window that displays on the screen, or it can be specified as a pdf file with the pdf_file
parameter.
CATEGORICAL VARIABLES
Conceptually, there are continuous variables and categorical variables. Categorical variables have relatively few unique data values. However, categorical variables can be defined with non-numeric values, but also with numeric values, such as responses to a five-point Likert scale from Strongly Disagree to Strongly Agree, with responses coded 1 to 5. The three by
–variables – by1
, by2
and by
– only apply to graphs created with numeric x
and/or y
variables, continuous or categorical.
The standard and most general way to define a categorical variable is as an R factor, such as created with the lessR factors
function. lessR
provides the option to define an integer variable with equally spaced values as categorical based on the value of n_cat
, which can be set locally or globally with the style
function. For example, for a variable with data values from 5-point Likert scale, a value of n_cat
of 5 will define the define the variable as categorical. The default value is 0. To explicitly analyze the values as categorical, set n_cat
to a value larger than 0, at least the size of the number of unique integer values. Can also annotate a graph of the values of an integer categorical variable with value_labels
option.
A scatterplot of Likert type data is problematic because there are so few possibilities for points in the scatterplot. For example, for a scatterplot of two five-point Likert response data, there are only 26 possible paired values to plot, so most of the plotted points overlap with others. In this situation, that is, when a single variable or two variables with Likert response scales are specified, a bubble plot is automatically provided, with the size of each point relative to the joint frequency of the paired data values. To request a sunflower plot in lieu of the bubble plot, set the shape
to "sunflower"
.
DATA
The default input data frame is d
. Specify another name with the data
option. Regardless of its name, the data frame need not be attached to reference the variables directly by its name, that is, no need to invoke the d$name
notation. The referenced variables can be in the data frame and/or the user's workspace, the global environment.
The data values themselves can be plotted, or for a single variable, counts or proportions can be plotted on the y
-axis. For a categorical x
-variable paired with a continuous variable, means and other statistics can be plotted at each level of the x
-variable. If x
is continuous, it is binned first, with the standard Histogram
binning parameters available, such as bin_width
, to override default values. The stat
parameter sets the values to plot, with data
the default. By default, the connecting line segments are provided, so a frequency polygon results. Turn off the lines by setting lwd=0
.
The rows
parameter subsets rows (cases) of the input data frame according to a logical expression. Use the standard R operators for logical statements as described in Logic
such as &
for and, |
for or and !
for not, and use the standard R relational operators as described in Comparison
such as ==
for logical equality !=
for not equals, and >
for greater than. See the Examples.
VALUE LABELS
[DEPRECATED. Use factor()
instead.] The value labels for each axis can be over-ridden from their values in the data to user supplied values with the value_labels
option. This option is particularly useful for Likert-style data coded as integers. Then, for example, a 0 in the data can be mapped into a "Strongly Disagree" on the plot. These value labels apply to integer categorical variables, and also to factor variables. To enhance the readability of the labels on the graph, any blanks in a value label translate into a new line in the resulting plot. Blanks are also transformed as such for the labels of factor variables.
However, the lessR function factors
allows for the easy creation of factors, one variable or a vector of variables, in a single statement, and is generally recommended as the method for providing value labels for the variables.
VARIABLE LABELS
Although standard R does not provide for variable labels, lessR
can store the labels in the data frame with the data, obtained from the Read
function or VariableLabels
. If variable labels exist, then the corresponding variable label is by default listed as the label for the corresponding axis and on the text output.
ONE VARIABLE PLOT
The one variable plot of one continuous variable generates either a violin/box/scatterplot (VBS plot), or a run chart with run=TRUE
, or x
can be an R time series variable for a time series chart. For the box plot,
for gray scale output potential outliers are plotted with squares and outliers are plotted with diamonds, otherwise shades of red are used to highlight outliers. The default definition of outliers is based on the standard boxplot rule of values more than 1.5 IQR's from the box. The definition of outliers may be adjusted (Hubert and Vandervieren, 2008), such that the whiskers are computed from the medcouple index of skewness (Brys, Hubert, & Struyf, 2004).
The plot can also be obtained as a bubble plot of frequencies for a categorical variable.
TWO VARIABLE PLOT
When two variables are specified to plot, by default if the values of the first variable, x
, are unsorted, or if there are unequal intervals between adjacent values, or if there is missing data for either variable, a scatterplot is produced from a call to the standard R plot
function. By default, sorted values with equal intervals between adjacent values of the first of the two specified variables yields a function plot if there is no missing data for either variable, that is, a call to the standard R plot
function with type="l"
, which connects each adjacent pair of points with a line segment.
Specifying multiple, continuous x
-variables against a single y variable, or vice versa, results in multiple plots on the same graph. The color of the points of the second variable is the same as that of the first variable, but with a transparent fill. For more than two x
-variables, multiple colors are displayed, one for each x
-variable.
BUBBLE PLOT FREQUENCY MATRIX (BPFM)
Multiple categorical variables for x
may be specified in the absence of a y
variable. (A categorical variable is either a factor
variable or an integer variable with n_cat
set at least at the number of unique values.) A bubble plot results that illustrates the frequency of each response for each of the variables in a common figure in which the x
-axis contains all of the unique labels for all of the variables plotted. Each line of information, the bubbles and counts for a single variable, replaces the standard bar chart in a more compact display. Usually the most meaningful when each variable in the matrix has the same response categories, that is, levels, such as for a set of shared Likert scales. The BPFM is considerably condensed presentation of frequencies for a set of variables than are the corresponding bar charts.
SCATTERPLOT MATRIX
A single vector of continuous variables specified as x
, with no y
-variable, generates a scatterplot matrix of the specified variable. A continuous variable is defined as a numeric variable with more than n_cat unique responses. To force an item with a small number of unique responses, such as from a 5-pt Likert scale, to be treated as continuous, set n_cat
to a number lower than 5, such as n_cat=0
in the function call.
The scatterplot matrix is displayed according to the current color theme. Specific colors such as fill
, color
, etc. can also be provided. The upper triangle shows the correlation coefficient, and the lower triangle each corresponding scatterplot, with, by default, the non-linear loess best fit line. The code
fit option can be used to provide the linear least squares line instead, along with the corresponding fit_color
for the color of the fit line.
SIZE VARIABLE
A variable specified with size=
is a numerical variable that activates a bubble plot in which the size of each bubble is determined by the value of the corresponding value of size
, which can be a variable or a constant.
To explicitly vary the shapes, use shape
and a list of shape values in the standard R form with the c
function to combine a list of values, one specified shape for each group, as shown in the examples. To explicitly vary the colors, use fill
, such as with R standard color names. If fill
is specified without shape
, then colors are varied, but not shapes. To vary both shapes and colors, specify values for both options, always with one shape or color specified for each level of the by
variable.
Shapes beyond the standard list of named shapes, such as "circle"
, are also available as single characters. Any single letter, uppercase or lowercase, any single digit, and the characters "+"
, "*"
and "#"
are available, as illustrated in the examples. In the use of shape
, either use standard named shapes, or individual characters, but not both in a single specification.
SCATTERPLOT ELLIPSE
For a scatterplot of two numeric variables, the ellipse=TRUE
option draws the .95 data ellipse as computed by the ellipse
function, written by Duncan Murdoch and E. D. Chow, from the ellipse
package. The axes are automatically lengthened to provide space for the entire ellipse that extends beyond the maximum and minimum data values. The specific level of the ellipse can be specified with a numerical value in the form of a proportion. Multiple numerical values of ellipse
may also be specified to obtain multiple ellipses.
BOXPLOTS
For a single variable the preferred plot is the integrated violin/box/scatter plot or VBS plot. Only the violin or box plot can be obtained with the corresponding aliases ViolinPlot
and BoxPlot
, or by setting vbs_plot
to "v"
or "b"
. To view a box plot of a continuous variable (Y) across the levels of a categorical variable (X), either as part of the full VBS plot, or by itself, there are two possibilities:
1. Plot(Y,X) or BoxPlot(Y, X)
2. Plot(Y, by1=X) or BoxPlot(Y, by1=X)
Both styles produce the same information. What differs is the color scheme.
The first possibility places the multiple box plots on a single pane and also, for the default color scheme "colors"
, displays the sequence of box plots with the default qualitative color palette from the lessR function getColors
.
All colors are displayed at the same level of gray-scale saturation and brightness to avoid perceptual bias. BarChart
and PieChart
use the same default colors as well.
The second possibility with by1
produces the different box plots on a separate panel, that is, a Trellis chart. These box plots are displayed with a single hue, the first color, blue, in the default qualitative sequence.
TIME CHARTS
Specifying one or more x
-variables with no y
-variables, and run=TRUE
plots the x
-variables in a run chart. The values of the specified x
-variable are plotted on the y
-axis, with Index on the x
-axis. Index is the ordinal position of each data value, from 1 to the number of values.
If the specified x
-variable is of type Date
, or is an R time series, a time series plot is generated for each specified variable. If a formal R time-series, univariate or multivariate, specify as the x
-variable. Or, specify the x
-variable of type Date
, and then specify the y
-variable as one or more time series to plot. The y
-variable can be formatted as tidy data with all the values in a single column, or as wide-formatted data with the time-series variables in separate columns.
The parameter time_unit
aggregates the date variable according to its specified value. The aggregation is based on two functions from the xts
package, endpoints()
and period.apply()
. For example, a data variable has daily values but is plotted with aggregated quarterly values. From the endpoints()
documentation: Valid values include: "us"
(microseconds), "microseconds"
, "ms"
(milliseconds), "milliseconds"
, "secs"
(seconds), "seconds"
, "mins"
(minutes), "minutes"
, "hours"
, "days"
, "weeks"
, "months"
, "quarters"
, and "years"
.
Specify the function by which to aggregate with the parameter time_agg
. The default is "sum"
.
2-D KERNEL DENSITY
With smooth=TRUE
, the R function smoothScatter
is invoked according to the current color theme. Useful for very large data sets. The smooth_points
parameter plots points from the regions of the lowest density. The smooth_bins
parameter specifies the number of bins in both directions for the density estimation. The smooth_exp
parameter specifies the exponent in the function that maps the density scale to the color scale to allow customization of the intensity of the plotted gradient colors. Higher values result in less color saturation, de-emphasizing points from regions of lessor density. These parameters are respectively passed directly to the smoothScatter
nrpoints
, nbin
and transformation
parameters. Grid lines are turned off,
by default, but can be displayed by setting the grid_color
parameter.
COLORS
A color theme for all the colors can be chosen for a specific plot with the colors
option with the lessR
function style
. The default color theme is "lightbronze"
. A gray scale is available with "gray"
, and other themes are available as explained in style
, such as "sienna"
and "darkred"
. Use the option style(sub_theme="black")
for a black background and partial transparency of plotted colors.
Colors can also be changed for individual aspects of a scatterplot as well with the style
function. To provide a warmer tone by slightly enhancing red, try a background color such as panel_fill="snow"
. Obtain a very light gray with panel_fill="gray99"
. To darken the background gray, try panel_fill="gray97"
or lower numbers. See the lessR
function showColors
, which provides an example of all available named R colors with their RGB values_
For the color options, such as violin_color
, the value of "off"
is the same as "transparent"
.
ANNOTATIONS
Use the add
and related parameters to annotate the plot with text and/or geometric figures. Each object is placed according from one to four corresponding coordinates, the required coordinates to plot that object, as shown in the following table. x
-coordinates may have the value of "mean_x"
and y
-coordinates may have the value of "mean_y"
.
Value | Object | Required Coordinates |
----------- | ------------------- | ----------------------- |
"text" | text | x1, y1 |
"point" | text | x1, y1 |
"rect" | rectangle | x1, y1, x2, y2 |
"line" | line segment | x1, y1, x2, y2 |
"arrow" | arrow | x1, y1, x2, y2 |
"v_line" | vertical line | x1 |
"h_line" | horizontal line | y1 |
"means" | horiz, vert lines | |
----------- | ------------------- | ----------------------- |
The value of add
specifies the object. For a single object, enter a single value. Then specify the value of the needed corresponding coordinates, as specified in the above table. For multiple placements of that object, specify vectors of corresponding coordinates. To annotate multiple objects, specify multiple values for add
as a vector. Then list the corresponding coordinates, for up to each of four coordinates, in the order of the objects listed in add
.
Can also specify vectors of different properties, such as add_color
. That is, different objects can be different colors, different transparency levels, etc.
PDF OUTPUT
To obtain pdf output, use the pdf_file
option, perhaps with the optional width
and height
options. These files are written to the default working directory, which can be explicitly specified with the R setwd
function.
ADDITIONAL OPTIONS
Commonly used graphical parameters that are available to the standard R function plot
are also generally available to Plot
, such as:
- cex.main, col.lab, font.sub, etc.
Settings for main- and sub-title and axis annotation, see
title
andpar
.- main
Title of the graph, see
title
.- xlim
The limits of the plot on the
x
-axis, expressed as c(x1,x2), wherex1
andx2
are the limits. Note thatx1 > x2
is allowed and leads to a reversed axis.- ylim
The limits of the plot on the
y
-axis.
ONLY VARIABLES ARE REFERENCED
A referenced variable in a lessR
function can only be a variable name. This referenced variable must exist in either the referenced data frame, such as the default d
, or in the user's workspace, more formally called the global environment. That is, expressions cannot be directly evaluated. For example:
> Plot(rnorm(50), rnorm(50)) # does NOT work
Instead, do the following:
> X <- rnorm(50) # create vector X in user workspace > Y <- rnorm(50) # create vector Y in user workspace > Plot(X,Y) # directly reference X and Y
Value
The output can optionally be saved into an R
object, otherwise it simply appears in the console. The output here is just for the outlier analysis of the two-variable scatterplot with continuous variables. The outlier identification must be activated for the analysis, such as from parameter MD_cut
.
READABLE OUTPUT
out_stats
: Correlational analysis.
out_outliers
: Mahalanobis Distance of each outlier.
STATISTICS
outliers
: Row numbers that contain the outliers.
Author(s)
David W. Gerbing (Portland State University; gerbing@pdx.edu)
References
Brys, G., Hubert, M., & Struyf, A. (2004). A robust measure of skewness. Journal of Computational and Graphical Statistics, 13(4), 996-1017.
Murdoch, D, and Chow, E. D. (2013). ellipse
function from the ellipse
package package.
Gerbing, D. W. (2014). R Data Analysis without Programming, Chapter 8, NY: Routledge.
Gerbing, D. W. (2020). R Visualizations: Derive Meaning from Data, Chapter 5, NY: CRC Press.
Gerbing, D. W. (2021). Enhancement of the Command-Line Environment for use in the Introductory Statistics Course and Beyond, Journal of Statistics and Data Science Education, 29(3), 251-266, https://www.tandfonline.com/doi/abs/10.1080/26939169.2021.1999871.
Hubert, M. and Vandervieren, E. (2008). An adjusted boxplot for skewed distributions, Computational Statistics and Data Analysis 52, 51865201.
Sarkar, Deepayan (2008) Lattice: Multivariate Data Visualization with R, Springer. http://lmdvr.r-forge.r-project.org/
See Also
plot
, stripchart
, title
, par
, loess
, Correlation
, style
.
Examples
# read the data
d <- rd("Employee", quiet=TRUE)
d <- d[.(random(0.6)),] # less computationally intensive
dd=d
#---------------------------------------------------
# traditional scatterplot with two numeric variables
#---------------------------------------------------
Plot(Years, Salary, by=Gender, size=2, fit="lm",
fill=c("olivedrab3", "gold1"),
color=c("darkgreen", "gold4"))
# scatterplot with all defaults
Plot(Years, Salary)
# or use abbreviation sp in place of Plot
# or use full expression ScatterPlot in place of Plot
# maximum information, minimum input: scatterplot +
# means, outliers, ellipse, least-squares lines with and w/o outliers
Plot(Years, Salary, enhance=TRUE)
# extend x and y axes
Plot(Years, Salary, scale_x=c(-10, 35, 10), scale_y=c(0,200000,10))
Plot(Years, Salary, add="Hi", x1=c(12, 16, 18), y1=c(80000, 100000, 60000))
Plot(Salary, row_names)
d <- factors(Gender, levels=c("M", "F"))
Plot(Years, Salary, by1=Gender)
d <- dd
# just males employed more than 5 years
Plot(Years, Salary, filter=(Gender=="M" & Years > 5))
# plot 0.95 data ellipse with the points identified that represent
# outliers defined by a Mahalanobis Distance larger than 6
# save outliers into R object out
d[1, "Salary"] <- 200000
out <- Plot(Years, Salary, ellipse=0.95, MD_cut=6)
# new shape and point size, no grid or background color
# then put style back to default
style(panel_fill="powderblue", grid_color="off")
Plot(Years, Salary, size=2, shape="diamond")
style()
# translucent data ellipses without points or edges
# show the idealized joint distribution for bivariate normality
style(ellipse_color="off")
Plot(Years, Salary, size=0, ellipse=seq(.1,.9,.10))
style()
# bubble plot with size determined by the value of Pre
# display the value for the bubbles with values of min, median and max
Plot(Years, Salary, size=Pre, size_cut=3)
# variables in a data frame not the default d
# plot 0.6 and 0.9 data ellipses with partially transparent points
# change color theme to gold with black background
style("gold", sub_theme="black")
Plot(eruptions, waiting, transparency=.5, ellipse=seq(.6,.9), data=faithful)
# scatterplot with two x-variables, plotted against Salary
# define a new style, then back to default
style(window_fill=rgb(247,242,230, maxColorValue=255),
panel_fill="off", panel_color="off", pt_fill="black", transparency=0,
lab_color="black", axis_text_color="black",
axis_y_color="off", grid_x_color="off", grid_y_color="black",
grid_lty="dotted", grid_lwd=1)
Plot(c(Pre, Post), Salary)
style()
# increase span (smoothing) from default of .7 to 1.25
# span is a loess parameter, which generates a caution that can be
# ignored that it is not a graphical parameter -- we know that
# display confidence intervals about best-fit line at
# 0.95 confidence level
Plot(Years, Salary, fit="loess", span=1.25)
# 2-D kernel density (more useful for larger sample sizes)
Plot(Years, Salary, smooth=TRUE)
#------------------------------------------------------
# scatterplot matrix from a vector of numeric variables
#------------------------------------------------------
# with least squares fit line
Plot(c(Salary, Years, Pre), fit="lm")
#--------------------------------------------------------------
# Trellis graphics and by for groups with two numeric variables
#--------------------------------------------------------------
# Trellis plot with condition on 1-variable
# optionally re-order default alphabetical R ordering by converting
# to a factor with lessR factors (which also does multiple variables)
# always save to the full data frame with factors
d <- factors(Gender, levels=c("M", "W"))
Plot(Years, Salary, by1=Gender)
d <- Read("Employee", quiet=TRUE)
# two Trellis classification variables with a single continuous
Plot(Salary, by1=Dept, by2=Gender)
# all three by (categorical) variables
Plot(Years, Salary, by1=Dept, by2=Gender, by=Plan)
# vary both shape and color with a least-squares fit line for each group
style(color=c("darkgreen", "brown"))
Plot(Years, Salary, by1=Gender, fit="lm", shape=c("F","M"), size=.8)
style("gray")
# compare the men and women Salary according to Years worked
# with an ellipse for each group
Plot(Years, Salary, by=Gender, ellipse=.50)
#--------------------------------------------------
# analysis of a single numeric variable (or vector)
#--------------------------------------------------
# One continuous variable
# -----------------------
# integrated Violin/Box/Scatterplot, a VBS plot
Plot(Salary)
Plot(Years, Salary, by=Gender, size=2, fit="lm",
fill=c("olivedrab3", "gold1"),
color=c("darkgreen", "gold4"))
# by variable, different colors for different values of the variable
# two panels
Plot(Salary, by1=Dept)
# large sample size
x <- rnorm(10000)
Plot(x)
# custom colors for outliers, which might not appear in this subset data
style(out_fill="hotpink", out2_fill="purple")
Plot(Salary)
style()
# no violin plot or scatterplot, just a boxplot
Plot(Salary, vbs_plot="b")
# or, the same with the mnemonic
BoxPlot(Salary)
# two related displays of box plots for different levels of a
# categorical variable
BoxPlot(Salary, by1=Dept)
# binned values to plot counts
# ----------------------------
# bin the values of Salary to plot counts as a frequency polygon
# the counts are plotted as points instead of the data
Plot(Salary, stat_x="count") # bin the values
# time charts
#------------
# run chart, with default fill area
Plot(Salary, run=TRUE, area_fill="on")
# two run charts in same plot
# or could do a multivariate time series
Plot(c(Pre, Post), run=TRUE)
# Trellis graphics run chart with custom line width, no points
Plot(Salary, run=TRUE, by1=Gender, lwd=3, size=0)
# daily time series plot
# create the daily time series from R built-in data set airquality
oz.ts <- ts(airquality$Ozone, start=c(1973, 121), frequency=365)
Plot(oz.ts)
# multiple time series plotted from dates and stacked
# black background with translucent areas, then reset theme to default
style(sub_theme="black", color="steelblue2", transparency=.55,
window_fill="gray10", grid_color="gray25")
date <- seq(as.Date("2013/1/1"), as.Date("2016/1/1"), by="quarter")
x1 <- rnorm(13, 100, 15)
x2 <- rnorm(13, 100, 15)
x3 <- rnorm(13, 100, 15)
df <- data.frame(date, x1, x2, x3)
rm(date); rm(x1); rm(x2); rm(x3)
Plot(date, x1:x3, data=df)
style()
# aggregate monthly data to plot by quarter
n.q <- 42
month <- seq(as.Date("2013/1/1"), length=n.q, by="months")
x <- rnorm(n.q, 100, 15)
Plot(month, x, time_unit="quarters")
# trigger a time series with a Date variable specified first
# stock prices for three companies by month: Apple, IBM, Intel
d <- rd("StockPrice")
# only plot Apple
Plot(Month, Price, filter=(Company=="Apple"))
# Trellis plots, one for each company
Plot(Month, Price, by1=Company, n_col=1)
# all three plots on the same panel, three shades of blue
Plot(Month, Price, by=Company, color="blues")
#------------------------------------------
# analysis of a single categorical variable
#------------------------------------------
d <- rd("Employee")
# default 1-D bubble plot
# frequency plot, replaces bar chart
Plot(Dept)
# plot of frequencies for each category (level), replaces bar chart
Plot(Dept, stat_x="count")
#----------------------------------------------------
# scatterplot of numeric against categorical variable
#----------------------------------------------------
# generate a chart with the plotted mean of each level
# rotate x-axis labels and then offset from the axis
style(rotate_x=45, offset=1)
Plot(Dept, Salary)
style()
#-------------------
# Cleveland dot plot
#-------------------
# row.names on the y-axis
Plot(Salary, row_names)
# standard scatterplot
Plot(Salary, row_names, sort_yx="0", segments_y=FALSE)
# Cleveland dot plot with two x-variables
Plot(c(Pre, Post), row_names)
#------------
# annotations
#------------
# add text at the one location specified by x1 and x2
Plot(Years, Salary, add="Hi There", x1=12, y1=80000)
# add text at three different specified locations
Plot(Years, Salary, add="Hi", x1=c(12, 16, 18), y1=c(80000, 100000, 60000))
# add three different text blocks at three different specified locations
Plot(Years, Salary, add=c("Hi", "Bye", "Wow"), x1=c(12, 16, 18),
y1=c(80000, 100000, 60000))
# add an 0.95 data ellipse and horizontal and vertical lines through the
# respective means
Plot(Years, Salary, ellipse=0.95, add=c("v_line", "h_line"),
x1="mean_x", y1="mean_y")
# can be done also with the following short-hand
Plot(Years, Salary, ellipse=0.95, add="means")
# a rectangle requires two points, four coordinates, <x1,y1> and <x2,y2>
style(add_trans=.8, add_fill="gold", add_color="gold4", add_lwd=0.5)
Plot(Years, Salary, add="rect", x1=12, y1=80000, x2=16, y2=115000)
# the first object, a rectangle, requires all four coordinates
# the vertical line at x=2 requires only an x1 coordinate, listed 2nd
Plot(Years, Salary, add=c("rect", "v_line"), x1=c(10, 2),
y1=80000, x2=12, y2=115000)
# two different rectangles with different locations, fill colors and translucence
style(add_fill=c("gold3", "green"), add_trans=c(.8,.4))
Plot(Years, Salary, add=c("rect", "rect"),
x1=c(10, 2), y1=c(60000, 45000), x2=c(12, 75000), y2=c(80000, 55000))
#----------------------------------------------------
# analysis of two categorical variables (Likert data)
#----------------------------------------------------
d <- rd("Mach4", quiet=TRUE) # Likert data, 0 to 5
# size of each plotted point (bubble) depends on its joint frequency
# triggered by default when replication of joint values and
# less than 9 unique data values for each
# n_cat=6 means treat responses as categorical for up to 6 equally-spaced
# integer values
Plot(m06, m07, n_cat=6)
# use value labels for the integer values, modify color options
LikertCats <- c("Strongly Disagree", "Disagree", "Slightly Disagree",
"Slightly Agree", "Agree", "Strongly Agree")
style(fill="powderblue", color="blue", bubble_text="darkred")
d <- factors(m01:m20, 0:5, labels=LikertCats)
Plot(m01:m10)
style() # reset theme
# get correlation analysis instead of cross-tab analysis
# rely upon the default value of n_cat=0 so that integer
# valued variables are analyzed as numerical
Plot(m06, m07)
#-----------------------------
# Bubble Plot Frequency Matrix
#-----------------------------
#---------------
# function curve
#---------------
x <- seq(10,50,by=2)
y1 <- sqrt(x)
y2 <- x**.33
# x is sorted with equal intervals so run chart by default
Plot(x, y1)
# multiple plots from variable vectors need to have the variables
# in a data frame
d <- data.frame(x, y1, y2)
# if variables are in the user workspace and in a data frame
# with the same names, the user workspace versions are used,
# which do not work with vectors of variables, so remove
rm(x); rm(y1); rm(y2)
Plot(x, c(y1, y2))
#-----------
# modern art
#-----------
clr <- colors() # get list of color names
color0 <- clr[sample(1:length(clr), size=1)]
clr <- clr[-(153:353)] # get rid of most of the grays
n <- sample(5:30, size=1)
x <- rnorm(n)
y <- rnorm(n)
color1 <- clr[sample(1:length(clr), size=1)]
color2 <- clr[sample(1:length(clr), size=1)]
style(window_fill=color0, color=color2)
Plot(x, y, run=TRUE, area_fill="on",
xy_ticks=FALSE, main="Modern Art", xlab="", ylab="",
cex.main=2, col.main="lightsteelblue", n_cat=0, center_line="off")
style() # reset style to default