Intro

Goals

You should be able to

Representations

Numeric and character types are fairly straightforward, and you rarely have to worry about when and whether R represents things as integers or floating point. Very occasionally you will need to know that R is limited in its capacity to represent numbers of extremely large or small magnitude, and that floating-point computations are always approximate.

You do need to know about factors, and to be aware when your variables are being treated as such.

Missing values

When you input data, you need to be aware of NA (“not available”). Your read function has an option called na.strings which you can use to communicate between R and your CSV files, for example. You need to know that

Changing representations

R has a big suite of functions for creating, testing and changing representations. These have names like factor(), as.numeric() and is.character().

Factors

Factors are logically variables with a fixed number of categories. In R they are represented as integer variables that have a “levels” attribute. In other words, each possible value of the factor is given an integer, and the variable carries around the code that allows translation from this integer to its meaning.

This system has advantages:

… and disadvantages:

As a general rule, convert variables to factors at the last possible time, when you have already fixed typos; combined different data sets; etc.. The tidyverse functions readr::read_csv(), readr::read_table(), readxl::read_xlsx() will not automatically convert strings to factors. You can set options(stringsAsFactors=FALSE) to disable R’s default behaviour of converting character data to factors when you use read.table() or read.csv() to read in data.

Examination

You should think creatively, and early on, about how to check your data. Is it internally consistent? Are there extreme outliers? Are there typos? Are there certain values that really mean something else?

An American Airlines memo about fuel reporting from the 1980s complained of multiple cases of:

You should think about what you can test, and what you can fix if it’s broken.

Graphical approaches are really useful for data cleaning; we will discuss this more later on.

What R functions do you know that are useful for examination? What are your strategies?

Manipulation

Even once your data are clean, there can still be a lot of work getting it into a format you want to use. “Data manipulation”, “munging”, or “wrangling” are the terms covering this activity. The RStudio folks have a useful cheat sheet.

This is a very, very hard, very general problem; everyone’s data problems and requirements are slightly different. But there are recurring themes.

Specialized data

Some data are in specialized formats, and often already cleaned. We will mention only in passing:

Tidy(ing) data

Hadley Wickham has defined a concept of tidy data, and has recently introduced the tidyr package.

Tools

The tidyverse

base R