Goals/contexts of data visualization

Exploration

Approaches

Goal: find patterns (& problems) in data, explore hypotheses

  • nonparametric/robust approaches: impose as few assumptions as possible
    • histograms for distributions
    • boxplots (median/IQR) instead of mean/std dev for grouped data
    • loess/GAM instead of linear/polynomial regression for continous data
  • need for speed: quick and dirty
  • canned routines for standard tasks, flexibility for non-standard tasks
  • manipulate to visualize: summarize on the fly

Visualization cycle

HW on Twitter

  • Avoiding snooping:
    • Separate response variables from predictors
    • Look at residuals but not fits

diagnostic plots

  • determine fitting problems, evaluate model assumptions
    • badness of fit, heteroscedasticity, non-Normality, outliers
    • e.g. scale-location plot, Q-Q plot; plot.lm
  • plot methods: generic (e.g. residuals vs fitted) vs specific (e.g. residuals vs predictors)
  • predictions (intuitive) vs residuals (amplifies/zooms in on discrepancies)
    • looking for absence of patterns in residuals
  • performance::check_model(), DHARMa packages

Inference

What is your model telling you?

  • A picture is worth 1000 words
  • coefficient plots: replacement for tables (Gelman, Pasarica, and Dodhia 2002); dotwhisker::dwplot
  • also: tests of inference (Wickham et al. 2010)

Explanation

Questions

  • should analyses match graphs?
    • “Let the data speak for themselves” vs “Tell a story”
  • display data (e.g. boxplots, standard deviations) or inferences from data (confidence intervals)
  • superimposing model fits (geom_smooth)

Goals

  • tell an accurate story
  • high information density
  • Tufte, Cleveland

Cleveland hierarchy

  • Cleveland hierarchy: position, length, angle, area, volume, colour: also see John Rauser (2016) cleveland

Aesthetics

other ideas

  • Extending the pipeline: R vs GLE vs. D3.js vs. plot.ly vs. …
  • Telling a story vs. letting the data speak
  • Showing the data vs. representing the statistical model accurately
  • Dynamic graphics

R approaches

Base graphics

  • advantages: speed, simplicity, breadth
  • disadvantages: lack of structure, canvas (draw-and-forget) metaphor
  • basic commands:
    • plot(), lines() and points()
    • legend(), axis() (for decoration)
    • other kinds of graphs: matplot() for multi-series plots, boxplot() for box plots, hist() for histograms, contour() and image(), pairs(), …
    • par(mfrow), layout() for multiple plots on a page
    • useful packages: car (for scatterplot(), scatterplotMatrix()), plotrix (miscellaneous “plot tricks”)

ggplot(2)

  • data
  • mappings: between variables in the data frame and aesthetics, or graphical attributes (x position, y position, size, colour …)
  • first two show up as (e.g.)
    ggplot(my_data,aes(x=age,y=rootgrowth,colour=phosphate))
  • geoms:
  • simple: geom_point, geom_line
(ggplot(my_data,aes(x=age,y=rootgrowth,colour=phosphate))+geom_point())
  • more complex: geom_boxplot, geom_smooth

more on ggplot

  • geoms are added to an existing data/mapping combination
  • facets: facet_wrap (free-form wrapping of subplots), facet_grid (two-D grid of subplots)
  • also: scales, coordinate transformations, statistical summaries, position adjustments
  • advantages: pretty defaults (mostly), flexible, easy to overlay model predictions etc.
  • disadvantages: slow, magical, tricky to customize

General strategies

  • primary response as y axis
  • most important predictor as x axis
    • (or substitute most important continuous or many-category predictor)
  • next (categorical) predictors as groupings (preferably colour/shape)
  • next (categorical) predictors as facets

More rules of thumb

  • Most salient comparisons within the same subplot (distinguished by color/shape), and nearby within the subplot when grouping bars/points
  • Facet rows > facet columns
  • flip axes to display labels better (ggplot: coord_flip(), ggstance() package)
  • Use transparency to include important but potentially distracting detail
  • Do category levels need to be identified or just distinguished?

  • Order categorical variables meaningfully (“What’s wrong with Alabama?”): forcats::fct_reorder(), forcats::fct_infreq()
  • Choose population variation (standard deviations, boxplots) vs. estimate variation (standard errors, mean \(\pm\) 2 SE, boxplot notches)
  • Choose colors carefully (RColorBrewer/ColorBrewer, IWantHue), colorspace, viridis packages: respect dichromats and B&W printouts
  • visual design (tweaking) vs. reproducibility (e.g. ggrepel, directlabels packages)

challenges:

  • multiple continuous predictors
  • multivariate responses
  • high-dimensional data generally
  • factors with lots of (unordered) levels
  • spatial data
  • phylogenetic trees + tip data
  • Large data sets

Data size affects graphic choices

  • small: show all points, possibly dodged/jittered, with some summary statistics: dotplot, beeswarm. Simple trends (linear/GLM)
  • medium: boxplots, loess, histograms, GAM (or linear regression)
  • large: modern nonparametrics: violin plots, hexbin plots, kernel densities computational burden, and display overlapping problems, relevant
  • combinations or overlays where appropriate (beanplot)

Bits and pieces

References

Gelman, Andrew, Cristian Pasarica, and Rahul Dodhia. 2002. “Let’s Practice What We Preach: Turning Tables into Graphs.” The American Statistician 56 (2): 121–30. http://www.tandfonline.com/doi/abs/10.1198/000313002317572790.
John Rauser. 2016. “How Humans See Data.” https://www.youtube.com/watch?v=fSgEeI2Xpdc.
Wickham, H., D. Cook, H. Hofmann, and Andreas Buja. 2010. “Graphical Inference for Infovis.” IEEE Transactions on Visualization and Computer Graphics 16 (6): 973–79. https://doi.org/10.1109/TVCG.2010.161.