Data Visualization in R

10:11 11 March 2024

Goals/contexts of data visualization

Exploration
Inference
Explanation

Exploration

Approaches

Goal: find patterns (& problems) in data, explore hypotheses

nonparametric/robust approaches: impose as few assumptions as possible
- histograms for distributions
- boxplots (median/IQR) instead of mean/std dev for grouped data
- loess/GAM instead of linear/polynomial regression for continous data
need for speed: quick and dirty
canned routines for standard tasks, flexibility for non-standard tasks
manipulate to visualize: summarize on the fly

Visualization cycle

HW on Twitter

Avoiding snooping:
- Separate response variables from predictors
- Look at residuals but not fits

diagnostic plots

determine fitting problems, evaluate model assumptions
- badness of fit, heteroscedasticity, non-Normality, outliers
- e.g. scale-location plot, Q-Q plot; plot.lm
plot methods: generic (e.g. residuals vs fitted) vs specific (e.g. residuals vs predictors)
predictions (intuitive) vs residuals (amplifies/zooms in on discrepancies)
- looking for absence of patterns in residuals
performance::check_model(), DHARMa packages

Inference

What is your model telling you?

A picture is worth 1000 words
coefficient plots: replacement for tables (Gelman, Pasarica, and Dodhia 2002); dotwhisker::dwplot
also: tests of inference (Wickham et al. 2010)

Explanation

Questions

should analyses match graphs?
- “Let the data speak for themselves” vs “Tell a story”
display data (e.g. boxplots, standard deviations) or inferences from data (confidence intervals)
superimposing model fits (geom_smooth)

Goals

tell an accurate story
high information density
Tufte, Cleveland

Cleveland hierarchy

Cleveland hierarchy: position, length, angle, area, volume, colour: also see John Rauser (2016)

Aesthetics

Edward Tufte: chartjunk, data- and non-data-ink, small multiples, direct labeling, …
graphics fascists: dynamite plots, pie charts, dual-axes plots …

other ideas

Extending the pipeline: R vs GLE vs. D3.js vs. plot.ly vs. …
Telling a story vs. letting the data speak
Showing the data vs. representing the statistical model accurately
Dynamic graphics

R approaches

Base graphics

advantages: speed, simplicity, breadth
disadvantages: lack of structure, canvas (draw-and-forget) metaphor
basic commands:
- plot(), lines() and points()
- legend(), axis() (for decoration)
- other kinds of graphs: matplot() for multi-series plots, boxplot() for box plots, hist() for histograms, contour() and image(), pairs(), …
- par(mfrow), layout() for multiple plots on a page
- useful packages: car (for scatterplot(), scatterplotMatrix()), plotrix (miscellaneous “plot tricks”)

ggplot(2)

data
mappings: between variables in the data frame and aesthetics, or graphical attributes (x position, y position, size, colour …)
first two show up as (e.g.)
ggplot(my_data,aes(x=age,y=rootgrowth,colour=phosphate))
geoms:
simple: geom_point, geom_line

(ggplot(my_data,aes(x=age,y=rootgrowth,colour=phosphate))+geom_point())

more complex: geom_boxplot, geom_smooth

General strategies

primary response as y axis
most important predictor as x axis
- (or substitute most important continuous or many-category predictor)
next (categorical) predictors as groupings (preferably colour/shape)
next (categorical) predictors as facets

More rules of thumb

Most salient comparisons within the same subplot (distinguished by color/shape), and nearby within the subplot when grouping bars/points
Facet rows > facet columns
flip axes to display labels better (ggplot: coord_flip(), ggstance() package)
Use transparency to include important but potentially distracting detail
Do category levels need to be identified or just distinguished?

Order categorical variables meaningfully (“What’s wrong with Alabama?”): forcats::fct_reorder(), forcats::fct_infreq()
Choose population variation (standard deviations, boxplots) vs. estimate variation (standard errors, mean $\pm$ 2 SE, boxplot notches)
Choose colors carefully (RColorBrewer/ColorBrewer, IWantHue), colorspace, viridis packages: respect dichromats and B&W printouts
visual design (tweaking) vs. reproducibility (e.g. ggrepel, directlabels packages)

challenges:

multiple continuous predictors
multivariate responses
high-dimensional data generally
factors with lots of (unordered) levels
spatial data
phylogenetic trees + tip data
Large data sets

Data size affects graphic choices

small: show all points, possibly dodged/jittered, with some summary statistics: dotplot, beeswarm. Simple trends (linear/GLM)
medium: boxplots, loess, histograms, GAM (or linear regression)
large: modern nonparametrics: violin plots, hexbin plots, kernel densities computational burden, and display overlapping problems, relevant
combinations or overlays where appropriate (beanplot)

Bits and pieces

ggplot2 extensions
Cookbook for R: Graphs, R Graph Gallery, R Graphics Cookbook ($$)
Exporting graphics: bitmap (PNG, JPEG) vs vector (SVG, PDF) representations
Andrew Gelman’s blog: Infovis vs. statistical graphics
Paul Krugman on axes starting from zero
Beyond 2D: googleVis, Rggobi, rgl, Mondrian, …

References

Gelman, Andrew, Cristian Pasarica, and Rahul Dodhia. 2002. “Let’s Practice What We Preach: Turning Tables into Graphs.” The American Statistician 56 (2): 121–30. http://www.tandfonline.com/doi/abs/10.1198/000313002317572790.

John Rauser. 2016. “How Humans See Data.” https://www.youtube.com/watch?v=fSgEeI2Xpdc.

Wickham, H., D. Cook, H. Hofmann, and Andreas Buja. 2010. “Graphical Inference for Infovis.” IEEE Transactions on Visualization and Computer Graphics 16 (6): 973–79. https://doi.org/10.1109/TVCG.2010.161.