Oct 2021

library(factoextra)
library(pheatmap)
library(directlabels)
library(GGally)
library(vegan)
library(andrews)

Data

We usually think of high-dimensional data as consisting of multiple measures on a group of samples:

  • Number of “reads” of different bacterial proteins from a set of soil samples
  • Decathlon scores from different competitors
  • Health measures from different children

dimension means “number of variables”

may be divided into predictors and responses

Types of measures

Many scientists traditionally think of high-dimensional data as having parallel, continuous measures:

  • read matrices from soil samples
  • these are easiest

These may be complemented by a smaller number of “metadata” variables, which may be more diverse in type (count, categorical, etc.):

  • environmental variables associated with soil samples

More and more datasets don’t follow this:

  • Canadian longitudinal study on aging has a huge number of variables per person with a wide mixture of types

Goals

  • typically looking for low-dimensional structure
    • clusters
    • surfaces/manifolds
  • exploratory? diagnostic? expository?

Approaches

  • Dimension reduction
  • glyphs (stars/radar charts/faces/Andrews plots)
  • Comparing views
  • Linking and brushing
  • 3D (perspective or animation)

radar chart

Duality

We study the rows (samples) using the columns (measures)

  • What do the observed proteins tell us about the functional relationships between different soil samples?
  • What does differential success in decathlon events tell us about the athletes?

But we can also do the opposite!

  • What do measurements across soil samples tell us about the functional relationships between proteins?
  • What does differential success of athletes tell us about the relationship between events?

Heatmaps

  • Value-based: better for seeing detail

  • Correlation-based better for seeing groups

PCA

A beautiful decomposition based on the idea that data points are points in a Euclidean space

  • Need to think about scaling

We can think about the PCA as a decomposition (making observed points from idealized points)

  • And relax it by requiring non-negative combinations of non-negative components (NMF)

Or we can think about it as minimizing distances:

  • And relax it with non-Euclidean distances (PCoA)
  • … or a rank-based approach (NMDS)

Accurate dataviz

To what extent can we make visual distances reflect data distances?

  • This is not the default in most applications