high-dimensional data

Oct 2021

library(factoextra)
library(pheatmap)
library(directlabels)
library(GGally)
library(vegan)
library(andrews)

Data

We usually think of high-dimensional data as consisting of multiple measures on a group of samples:

dimension means “number of variables”

may be divided into predictors and responses

Many scientists traditionally think of high-dimensional data as having parallel, continuous measures:

These may be complemented by a smaller number of “metadata” variables, which may be more diverse in type (count, categorical, etc.):

Canadian longitudinal study on aging has a huge number of variables per person with a wide mixture of types

typically looking for low-dimensional structure
- clusters
- surfaces/manifolds
exploratory? diagnostic? expository?

We study the rows (samples) using the columns (measures)

What do the observed proteins tell us about the functional relationships between different soil samples?
What does differential success in decathlon events tell us about the athletes?

But we can also do the opposite!

What do measurements across soil samples tell us about the functional relationships between proteins?
What does differential success of athletes tell us about the relationship between events?

A beautiful decomposition based on the idea that data points are points in a Euclidean space

We can think about the PCA as a decomposition (making observed points from idealized points)

And relax it by requiring non-negative combinations of non-negative components (NMF)

Or we can think about it as minimizing distances:

To what extent can we make visual distances reflect data distances?