library(factoextra)
library(pheatmap)
library(directlabels)
library(GGally)
library(vegan)
library(andrews)
Data
We usually think of high-dimensional data as consisting of multiple measures on a group of samples:
- Number of “reads” of different bacterial proteins from a set of soil samples
- Decathlon scores from different competitors
- Health measures from different children
dimension means “number of variables”
may be divided into predictors and responses
Types of measures
Many scientists traditionally think of high-dimensional data as having parallel, continuous measures:
- read matrices from soil samples
- these are easiest
These may be complemented by a smaller number of “metadata” variables, which may be more diverse in type (count, categorical, etc.):
- environmental variables associated with soil samples
More and more datasets don’t follow this:
- Canadian longitudinal study on aging has a huge number of variables per person with a wide mixture of types
Goals
- typically looking for low-dimensional structure
- clusters
- surfaces/manifolds
- exploratory? diagnostic? expository?
Approaches
- Dimension reduction
- glyphs (stars/radar charts/faces/Andrews plots)
- Comparing views
- Linking and brushing
- 3D (perspective or animation)
radar chart
Duality
We study the rows (samples) using the columns (measures)
- What do the observed proteins tell us about the functional relationships between different soil samples?
- What does differential success in decathlon events tell us about the athletes?
But we can also do the opposite!
- What do measurements across soil samples tell us about the functional relationships between proteins?
- What does differential success of athletes tell us about the relationship between events?
Heatmaps
PCA
A beautiful decomposition based on the idea that data points are points in a Euclidean space
- Need to think about scaling
We can think about the PCA as a decomposition (making observed points from idealized points)
- And relax it by requiring non-negative combinations of non-negative components (NMF)
Or we can think about it as minimizing distances:
- And relax it with non-Euclidean distances (PCoA)
- … or a rank-based approach (NMDS)
Accurate dataviz
To what extent can we make visual distances reflect data distances?
- This is not the default in most applications