September 2021

Course structure

Course goals

General introduction to data viz principles and tools

Course structure

  • lectures
  • in-class work
  • homework
  • student projects & presentations

Tools

Version control

  • Git: distributed version control system
  • GitHub: centralized version control server
    • alternatives: BitBucket, GitLab, …
  • Git clients: software for working with Git on your computer
    • command-line (e.g. git add foo.rmd)
    • RStudio
    • others (GitHub desktop etc.)

Basic Git workflow with RStudio

  • create repository on Github
  • copy repository to local machine
    • git clone
    • RStudio: File > New Project > Version Control > Git > fill in name from “Clone” button on GH

  • repeat:
    • pull (fetch and integrate changes from GH, git pull)
      • RStudio: Git panel > click blue down-arrow
    • do stuff (create, edit files, etc.)
    • stage (git add)
      • RStudio: Git panel > click “Staged” button
    • commit (git commit)
      • RStudio: Git panel > click “Commit” icon >
        enter commit message > click “Commit” button (ignore “amend previous commit” button!)
    • push (git push)
      • RStudio: Git panel > click green up-arrow

tidyverse

  • set of R packages: https://www.tidyverse.org/
  • advantages
    • expressiveness
    • speed
    • new hotness
  • disadvantages
    • minor incompatibilities with base R
    • rapid evolution
    • non-standard evaluation

tidyverse: big ideas

  • new verbs
  • piping
  • tibbles

tidyverse: new verbs

  • filter(x,condition): choose rows equivalent to subset(x,condition) or x[condition,] (with non-standard evaluation)
  • select(x,condition): choose columns
    • equivalent to subset(x,select=condition) or x[,condition]
    • helper functions such as starts_with(), matches()
  • mutate(x,var=...): change or add variables (equivalent to x$var = ... or transform(x,var=...)

tidyverse: split-apply-combine

  • group_by(): adds grouping information
  • summarise(): collapses variables to a single value
  • e.g.
x <- group_by(x,course)
summarise(x,mean_score=mean(score),sd_score=sd(score))
  • equivalent to
d_split <- split(d,d$var)       ## split
d_proc <- lapply(d_split, ...)  ## apply
d_res <- do.call(rbind,d_proc)  ## combine

tidyverse: piping

  • new %>% operator (orig. from magrittr package)
  • directs result of previous operation to next function, as first argument
  • e.g.
(d_input
    %>% select(row1,row2)
    %>% filter(cond1,cond2)
    %>% mutate(...)
) -> d_output

tidyverse: tibbles

  • extension of data frames
  • differences
    • printing
      • only prints first few rows/columns
      • labels columns by type
    • no rownames
    • never drops dimensions (tib[,"column1"] is still a tibble)

tidyverse: reshaping (tidyr package)

  • pivot_longer(data,cols, names_to, values_to)
    • see here and here for more info
    • wide to long
    • reshape2::melt()
    • gather() in tidyr pre v 1.0
  • pivot_wider(data,names_from, values_from)
    • long to wide
    • reshape2::cast()
    • spread() in tidyr pre v 1.0

types of data visualization

exploratory

  • find patterns in data, explore hypotheses
  • emphasize robust approaches
  • minimize (parametric) assumptions
  • John Tukey, William Cleveland (1993)

diagnostic

  • evaluate assumptions of a model
    • unbiasedness/goodness of fit
    • homoscedasticity
    • normality
  • easily spot deviations
  • identify outliers and influential points

inferential

  • coefficient plots (e.g. dotwhisker package)
  • replacement for tables (Gelman et al. 2002)
  • also: tests of inference (Wickham et al. 2010)
  • Andrew Gelman

expository: data-viz

  • tell an accurate story
  • high information density
  • Cleveland, Edward Tufte

presentation: info-viz

  • grab attention/engage/sell/entertain
  • “puzzle” graphics

dashboards

  • present a quick overview of a data set
  • user control
  • business-oriented

dynamic

  • time dimension
  • engage
  • allow viewer to drill down
  • Dianne Cook

References