Exploring data
Rote analysis vs. snooping
Spurious correlations
There’s a whole website about this
What can you do?
The best you can
Identify scientific questions
Distinguish between exploratory and confirmatory analysis
Pre-register studies when possible
Keep an exploration and analysis journal
Explore predictors and responses separately at first
Individual variables
Individual variables
Orchard data
A standard plot
A terrible plot
What does this one mean?
What does this one mean?
A non-parametric plot
Bike example
Just the means
Standard errors
Standard errors
Standard deviations (2 sd, in fact)
Data shape
Data shape
Shape and weight
Shape and weight
Shape and weight
Shape and weight
Log scales
- In general:
- If your logged data span < 3 decades, use human-readable numbers (e.g., 10-5000 kilotons per hectare)
- If not, just embrace ``logs’’ (log10 particles per ul is from 3–8)
- But remember these are not physical values
- I love natural logs, but not as axis values
- Except to represent proportional difference!
Bivariate data
Banking
- Banking is a real thing
- Even though many examples are bogus
- Since the point is to make patterns visually clear, trial-and-error is usually as good as algorithm
Sunspots
Sunspots
Code (with built-in data)
Is smoking good for you?
Smoking data
Smoking data
Smoking data
Scatter plots
Scatter plot
Seeing the density better
Seeing the density worse
Maybe fixed
A loess trend line
Two loess trend lines
Many loess trend lines
Theory of loess
Robust methods
- Loess is local, but not robust
- Uses least squares, can respond strongly to outliers
- R is has a very flexible function called rlm to do robust fitting
- Not local
- But can be combined with splines
rlm fits
rlm fits
Density plots
- Contours
- use
_density_2d()
to fit a two-dimensional kernel to the density
- hexes
- use
geom_hex
to plot densities using hexes
- this can also be done using rectangles for data with more discrete values
Contours
Contours
Contours
Hexes
Hexes
Hexes
Color principles
Multiple dimensions
Multiple dimensions
Three dimensional data is a lot like two-d with densities: contour plots are good
Pairs plots: pairs
, ggpairs
Pairs example
Multiple factors
Use boxplots and violin plots
Make use of facet_wrap
and facet_grid
Use different combinations (e.g., try plots with the same info, but different factors on the axes vs.~in the colors or the facets)