Exploring data
Rote analysis vs. snooping


Spurious correlations
There’s a whole website about this
What can you do?
The best you can
Identify scientific questions
Distinguish between exploratory and confirmatory analysis
Pre-register studies when possible
Keep an exploration and analysis journal
Explore predictors and responses separately at first
Individual variables
Individual variables
Orchard data
A standard plot

A terrible plot

What does this one mean?

What does this one mean?

A non-parametric plot

Bike example
Just the means

Standard errors

Standard errors

Standard deviations (2 sd, in fact)

Data shape

Data shape

Shape and weight

Shape and weight

Shape and weight

Shape and weight

Log scales
- In general:
- If your logged data span < 3 decades, use human-readable numbers (e.g., 10-5000 kilotons per hectare)
- If not, just embrace ``logs’’ (log10 particles per ul is from 3–8)
- But remember these are not physical values
- I love natural logs, but not as axis values
- Except to represent proportional difference!
Bivariate data
Banking
- Banking is a real thing
- Even though many examples are bogus
- Since the point is to make patterns visually clear, trial-and-error is usually as good as algorithm
Sunspots

Sunspots

Code (with built-in data)
Is smoking good for you?
Smoking data

Smoking data

Smoking data

Scatter plots
Scatter plot

Seeing the density better

Seeing the density worse

Maybe fixed

A loess trend line

Two loess trend lines

Many loess trend lines

Theory of loess
Robust methods
- Loess is local, but not robust
- Uses least squares, can respond strongly to outliers
- R is has a very flexible function called rlm to do robust fitting
- Not local
- But can be combined with splines
rlm fits

rlm fits

Density plots
- Contours
- use
_density_2d()
to fit a two-dimensional kernel to the density
- hexes
- use
geom_hex
to plot densities using hexes
- this can also be done using rectangles for data with more discrete values
Contours

Contours

Contours

Hexes

Hexes

Hexes

Color principles
Multiple dimensions
Multiple dimensions
Three dimensional data is a lot like two-d with densities: contour plots are good
Pairs plots: pairs
, ggpairs
Pairs example

Multiple factors
Use boxplots and violin plots
Make use of facet_wrap
and facet_grid
Use different combinations (e.g., try plots with the same info, but different factors on the axes vs.~in the colors or the facets)