11:13 20 January 2026

Goals/contexts of data visualization

  • Exploration

  • Inference

  • Explanation

Exploration

Approaches

Goal: find patterns (& problems) in data, explore hypotheses

  • nonparametric/robust approaches: impose as few assumptions as possible
    • histograms for distributions
    • boxplots (median/IQR) instead of mean/std dev for grouped data
    • loess/GAM instead of linear/polynomial regression for continous data
  • need for speed: quick and dirty
  • canned routines for standard tasks, flexibility for non-standard tasks
  • manipulate to visualize: summarize on the fly

Visualization cycle

HW on Twitter

  • Avoiding snooping:
    • Separate response variables from predictors
    • Look at residuals but not fits

diagnostic plots

  • determine fitting problems, evaluate model assumptions
    • badness of fit, heteroscedasticity, non-Normality, outliers
    • e.g. scale-location plot, Q-Q plot; plot.lm
  • plot methods: generic (e.g. residuals vs fitted) vs specific (e.g. residuals vs predictors)
  • predictions (intuitive) vs residuals (amplifies/zooms in on discrepancies)
    • looking for absence of patterns in residuals
  • performance::check_model(), DHARMa packages

visual diagnostics

Inference

What is your model telling you?

  • A picture is worth 1000 words
  • coefficient plots: replacement for tables (Gelman, Pasarica, and Dodhia 2002); dotwhisker::dwplot

Tables vs figures

Explanation

Questions

  • should analyses match graphs?
    • “Let the data speak for themselves” vs “Tell a story”
  • display data (e.g. boxplots, standard deviations) or inferences from data (confidence intervals)
  • superimposing model fits (geom_smooth)

Goals

  • tell an accurate story
  • high information density
  • Tufte, Cleveland

Cleveland hierarchy

  • Cleveland hierarchy: position, length, angle, area, volume, colour: also see John Rauser (2016)

Aesthetics

other ideas

  • Extending the pipeline: R vs GLE vs. D3.js vs. plot.ly vs. …
  • Showing the data vs. representing the statistical model accurately
  • Dynamic graphics (animations, pop-ups/hovertext/responsive elements, etc.)

R approaches

Base graphics

  • advantages: speed, simplicity, breadth
  • disadvantages: lack of structure, canvas (draw-and-forget) metaphor
  • basic commands:
    • plot(), lines() and points()
    • legend(), axis() (for decoration)
    • other kinds of graphs: matplot() for multi-series plots, boxplot() for box plots, hist() for histograms, contour() and image(), pairs(), …
    • par(mfrow), layout() for multiple plots on a page
    • useful packages: car (for scatterplot(), scatterplotMatrix()), plotrix (miscellaneous “plot tricks”), tinyplot (lightweight flexible base plots)

ggplot(2)

  • data
  • mappings: between variables in the data frame and aesthetics, or graphical attributes (x position, y position, size, colour …)
  • first two show up as (e.g.)
    ggplot(my_data,aes(x=age,y=rootgrowth,colour=phosphate))
  • geoms:
  • simple: geom_point, geom_line
(ggplot(my_data,aes(x=age,y=rootgrowth,colour=phosphate))+geom_point())
  • more complex: geom_boxplot, geom_smooth

more on ggplot

  • geoms are added to an existing data/mapping combination
  • facets: facet_wrap (free-form wrapping of subplots), facet_grid (two-D grid of subplots)
  • also: scales, coordinate transformations, statistical summaries, position adjustments
  • advantages: pretty defaults (mostly), flexible, easy to overlay model predictions etc.
  • disadvantages: slow, magical, tricky to customize

General strategies

  • primary response as y axis
  • most important predictor as x axis
    • (or substitute most important continuous or many-category predictor)
  • next (categorical) predictors as groupings (preferably colour/shape)
  • next (categorical) predictors as facets

More rules of thumb

  • Most salient comparisons within the same subplot (distinguished by color/shape), and nearby within the subplot when grouping bars/points
  • Facet rows > facet columns
  • flip axes to display labels better (ggplot: coord_flip(), ggstance() package)
  • Use transparency to include important but potentially distracting detail
  • Do category levels need to be identified or just distinguished?

  • Order categorical variables meaningfully (“What’s special about Alabama [or Alberta]?”): forcats::fct_reorder(), forcats::fct_infreq()
  • Choose whether to display population variation (standard deviations, boxplots) or mean variation (mean \(\pm\) 1 or 2 SE, boxplot notches)
  • Choose colors carefully (RColorBrewer/ColorBrewer, IWantHue), colorspace, viridis ggokabeito packages: respect dichromats
  • visual design (tweaking) vs. reproducibility (e.g. ggrepel, directlabels packages)

challenges:

  • multiple continuous predictors
  • multivariate responses
  • high-dimensional data generally
  • factors with many (unordered) levels
  • spatial data
  • phylogenetic trees
  • big data

Data size affects graphic choices

  • small: show all points, possibly dodged/jittered, with some summary statistics: dotplot, beeswarm. Simple trends (linear/GLM)
  • medium: boxplots, loess, histograms, GAM (or linear regression)
  • large: modern nonparametrics: violin plots, hexbin plots, kernel densities
  • combinations or overlays where appropriate (beanplot)

Examples

Bits and pieces

References