Packages

library(tidyverse)
theme_set(theme_bw())
library(car)
library(broom)
library(broom.mixed)
library(magrittr)
## modeling
library(lme4)
library(MCMCglmm)
library(glmmTMB)
library(coda) ## Bayesian methods (trace plots etc.)
library(lattice) ## built-in
library(cowplot)
library(datasauRus)
library(nullabor) ## visual inference

Anscombe’s quartet

Anscombe’s quartet (2)

“Datasaurus”

Matejka and Fitzmaurice (2017)

Diagnostics: goals and ideas

  • detect model failure
  • display badness of fit
  • fast/convenient
  • residuals emphasize deviation, hide fitted pattern

Diagnostics: principles

  • diagnosis is exploration
  • avoid making decisions based on p-values
  • judge by eye (?¿?¿?¿?¿)
  • not “are my data (linear|normal|heteroscedastic)?”;
    rather, “how much do the violations change my conclusions?”

always do diagnostics after fitting model

  • Interested in conditional, not marginal values
  • What does this mean?

marginal distributions

quantile plots

m0 <- lm(y~x,dd)
a0 <- (augment(m0) ## broom: 'tidy' predictions/resids/etc
  %>% select(x,y,.resid)
  %>% pivot_longer(everything(), names_to="type",values_to="value")
  %>% mutate(type=factor(type,levels=c("x","y",".resid")))
)

(ggplot(a0,aes(sample=value))
  + stat_qq()
  + facet_wrap(~type)
  + stat_qq_line(colour="red")
)

model diagnosis

look for mis-specification (in order!):

  • mean model (bias) >
  • variance model (heteroscedasticity) >
  • distributional model (e.g. non-normality)

influential points/groups (leverage/outliers/etc.)

upstream problems affect downstream diagnostics

bias

  • typically due to underfitting: missing patterns
  • e.g. nonlinearity
  • examine residuals rather than predicted values

bias: residuals vs fitted plot

m1 <- lm(price~carat,diamonds)
a1 <- augment(m1,data=diamonds) ## include original data
ggplot(a1,aes(.fitted,.resid)) +
    geom_point(alpha=0.1)+geom_smooth()

bias 2: add faceting/colouring

ggplot(a1,aes(.fitted,.resid,colour=cut)) +
    facet_wrap(~clarity) +
    geom_point(alpha=0.4)+geom_smooth(se=FALSE)

useful to use dynamic graphics ggmap::gglocator (may need devtools::install_github("dkahle/ggmap"))

bias: solutions

  • fix the model
  • add covariates and interactions
  • transform predictors and/or responses (acepack::avas, Tibshirani (1987))
  • model expansion
    • polynomials
    • splines (regular or penalized)
    • nonlinear models

heteroscedasticity

  • linear models
    • loss of efficiency (linear fit is still MVUE)
    • inferential problems (Quinn and Keough (2002) p. 193)
  • nonlinear models
    • causes bias

heteroscedasticity diagnostics

  • scale-location plot
  • typically use \(\sqrt{|r_i|}\):
    • absolute value shows trend
    • square root decreases skewness
  • use standardized residuals
    (adjust variance for position)

m2 <- lm(dist ~ speed, data=cars)
ggplot(augment(m2),aes(.fitted,sqrt(abs(.std.resid))))+
    geom_point()+geom_smooth()

heteroscedasticity solutions

  • transformation (Tibshirani 1987)
  • explicitly model heteroscedasticity
    e.g. generalized least squares, GLMs
  • robust variance-covariance estimation
    (e.g. sandwich package: Zeileis (2006))

outliers

  • plots of leverage or “hat value” (potential influence) and Cook’s distance (influence)
  • influence plots

influence plots

ii <- car::influencePlot(m2)

outliers: solutions

  • exclusion (!!)
  • robust methods

distributional assumptions

  • least important
  • quantile-quantile plots

histograms

quantile plots

  • ggplot: stat_qq(), stat_qq_line()
  • base R: plot.lm(.,which=3); qqnorm()
  • car::qqPlot (adds confidence envelope)

distributional solutions

  • transformation (avas, Box-Cox (MASS:boxcox), Yeo-Johnson etc. [?car::bcPower])
  • GLMs
  • maximum likelihood estimation

correlation

rarely tested! can’t detect without some kind of structure in data

  • autocorrelation plots from residuals
  • grouped autocorrelation: use gls() on residuals
  • spatial autocorrelation: semivariance plot
  • or look at maps of residuals with size=abs(.resid), colour=sign(.resid) (or colour ramp)

binary data

  • residuals for count data only \(\approx\) Normal for large counts
  • add smooths or average of grouped data

Fit:

library(lme4)
data(Contraception,package="mlmRev")
Contraception <- Contraception %>%
    mutate(ch=factor(livch != 0, labels = c("N", "Y")))
m3 <- glmer(use ~ age * ch + I(age^2) + urban + (1 | urban:district),
            data=Contraception, family=binomial)

plot

a3 <- augment(m3,data=Contraception,type.residuals="response")
gg_bin1 <- (ggplot(a3,aes(.fitted,.resid))+
            geom_point()+ geom_smooth(method="loess"))
print(gg_bin1)

grouping

get_mid <- function(x) {
    cc <- as.character(x)
    lo <- as.numeric(gsub("[\\(\\[]([[:digit:].-]+).*","\\1",cc))
    hi <- as.numeric(gsub(".*,([[:digit:].-]+)[])]","\\1",cc))
    return((lo+hi)/2)
}
(a3
    %>% mutate(.fit_cut=cut_number(.fitted,20))
    %>% group_by(.fit_cut)
    %>% summarise(.resid=mean(.resid))
    %>% ungroup
    %>% mutate(.fitted=get_mid(.fit_cut))
) -> a3_sum

plot with grouping

gg_bin1+geom_point(data=a3_sum,colour="blue")

ggplot(a3,aes(.fitted,.resid,colour=livch,shape=urban,linetype=urban))+
            geom_point()+ geom_smooth(se=FALSE)+
    scale_colour_brewer(palette="Dark2")

keep trying …

ggplot(a3,aes(age,.resid,colour=urban))+
    geom_point()+
    geom_smooth(method="loess")+
    facet_wrap(~livch)

  • loess too bumpy?

ggplot(a3,aes(age,.resid,colour=urban))+
    geom_point()+
    geom_smooth(method="loess",
                method.args=list(family="symmetric"),span=1)+
    facet_wrap(~livch)

  • try method="gam" ?
ggplot(a3,aes(age,.resid,colour=urban))+
    geom_point()+
    geom_smooth(method="gam",formula =y ~ s(x, k=25)) +
    facet_wrap(~livch)

DHARma

rr <- DHARMa::simulateResiduals(m3)
plot(rr)

likelihood profiles

use \(\sqrt{-2 \log (L-L_0)}\) (\(\sf V\)-shaped), signed square root (straight line/symmetry)

## `geom_smooth()` using formula 'y ~ x'

MCMC

  • trace plots - should look like white noise, with no trend …
lattice::xyplot(m4$Sol,aspect="fill",layout=c(2,3))

more MCMC diagnostics

  • comparing behaviour of different chains (Säilynoja, Bürkner, and Vehtari 2021)
  • checking whether a given sampler correctly reconstructs simulated data (Talts et al. 2020)

complex models

Wickham et al. (2010); Gelman (2004); Buja et al. (2009)

simdat <- (simulate(m2,8)
    %>% data.frame(speed=cars$speed)
    %>% gather(sample,dist,-speed))
ddsim <- (cars
    %>% select(dist,speed)
    %>% mutate(sample="true")
    %>% bind_rows(simdat))
ddsimplot <- ggplot(ddsim,aes(speed,dist))+geom_point()+
    facet_wrap(~sample)

references

Buja, A., D. Cook, H. Hofmann, M. Lawrence, E.-K. Lee, D. F. Swayne, and H. Wickham. 2009. “Statistical Inference for Exploratory Data Analysis and Model Diagnostics.” Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences 367 (1906): 4361–83. https://doi.org/10.1098/rsta.2009.0120.

Gelman, Andrew. 2004. “Exploratory Data Analysis for Complex Models.” Journal of Computational and Graphical Statistics 13 (4): 755–79. https://doi.org/10.1198/106186004X11435.

Matejka, Justin, and George Fitzmaurice. 2017. “The Datasaurus Dozen - Same Stats, Different Graphs: Generating Datasets with Varied Appearance and Identical Statistics Through Simulated Annealing.” In ACM Sigchi Conference on Human Factors in Computing Systems. https://doi.org/10.1145/3025453.3025912.

Quinn, Gerry P., and Michael J. Keough. 2002. Experimental Design and Data Analysis for Biologists. Cambridge, England: Cambridge University Press.

Säilynoja, Teemu, Paul-Christian Bürkner, and Aki Vehtari. 2021. “Graphical Test for Discrete Uniformity and Its Applications in Goodness of Fit Evaluation and Multiple Sample Comparison.” arXiv:2103.10522 [Stat], March. http://arxiv.org/abs/2103.10522.

Talts, Sean, Michael Betancourt, Daniel Simpson, Aki Vehtari, and Andrew Gelman. 2020. “Validating Bayesian Inference Algorithms with Simulation-Based Calibration.” arXiv:1804.06788 [Stat], October. http://arxiv.org/abs/1804.06788.

Tibshirani, Rob. 1987. “Estimating Optimal Transformations for Regression.” Journal of the American Statistical Association 83: 394.

Wickham, H., D. Cook, H. Hofmann, and Andreas Buja. 2010. “Graphical Inference for Infovis.” IEEE Transactions on Visualization and Computer Graphics 16 (6): 973–79. https://doi.org/10.1109/TVCG.2010.161.

Zeileis, Achim. 2006. “Object-Oriented Computation of Sandwich Estimators.” Journal of Statistical Software 16 (9): 1–16. http://www.jstatsoft.org/v16/i09/.