You should be able to
Numeric and character types are fairly straightforward, and you rarely have to worry about when and whether R represents things as integers or floating point. Very occasionally you will need to know that R is limited in its capacity to represent numbers of extremely large or small magnitude, and that floating-point computations are always approximate.
You do need to know about factors, and to be aware when your variables are being treated as such.
When you input data, you need to be aware of NA
(“not
available”). Your read function has an option called
na.strings
which you can use to communicate between R and
your CSV files, for example. You need to know that
NA==x
is always NA
is.na()
to test for NA
values,
na.omit()
to drop them, and the optional na.rm
argument in some functions (mean
, sum
,
median
…)R has a big suite of functions for creating, testing and changing
representations. These have names like factor()
,
as.numeric()
and is.character()
.
Factors are logically variables with a fixed number of categories. In R they are represented as integer variables that have a “levels” attribute. In other words, each possible value of the factor is given an integer, and the variable carries around the code that allows translation from this integer to its meaning.
This system has advantages:
… and disadvantages:
as.numeric(f)
is not the same as
as.numeric(as.character(f))
As a general rule, convert variables to factors at the last
possible time, when you have already fixed typos; combined
different data sets; etc.. The tidyverse functions
readr::read_csv()
, readr::read_table()
,
readxl::read_xlsx()
will not automatically
convert strings to factors. You can set
options(stringsAsFactors=FALSE)
to disable R’s default
behaviour of converting character data to factors when you use
read.table()
or read.csv()
to read in
data.
You should think creatively, and early on, about how to check your data. Is it internally consistent? Are there extreme outliers? Are there typos? Are there certain values that really mean something else?
An American Airlines memo about fuel reporting from the 1980s complained of multiple cases of:
You should think about what you can test, and what you can fix if it’s broken.
Graphical approaches are really useful for data cleaning; we will discuss this more later on.
What R functions do you know that are useful for examination? What are your strategies?
Even once your data are clean, there can still be a lot of work getting it into a format you want to use. “Data manipulation”, “munging”, or “wrangling” are the terms covering this activity. The RStudio folks have a useful cheat sheet.
This is a very, very hard, very general problem; everyone’s data problems and requirements are slightly different. But there are recurring themes.
Some data are in specialized formats, and often already cleaned. We will mention only in passing:
Hadley Wickham has defined a concept of tidy data, and has
recently introduced the tidyr
package.
tidyr
package (reshaping): gather()
,
spread()
dplyr
package:
mutate()
select()
filter()
group_by()
summarise()
arrange()
*_join()
reshape()
: wide-to-long and vice versamerge()
: join data framesave()
: compute averages by groupsubset()
, [
-indexing: select obs and
varstransform()
: modify variables and create new onesaggregate()
: split-apply-summarizesplit()
, lapply()
,
do.call(rbind())
: split-apply-combinesort()