You should be able to
Numeric and character types are fairly straightforward, and you rarely have to worry about when and whether R represents things as integers or floating point. Very occasionally you will need to know that R is limited in its capacity to represent numbers of extremely large or small magnitude, and that floating-point computations are always approximate.
You do need to know about factors, and to be aware when your variables are being treated as such.
When you input data, you need to be aware of NA (“not
available”). Your read function has an option called
na.strings (base R, read.*) or na
(tidyverse, read_*) which you can use to communicate
between R and your CSV files, for example specifying that “*” denotes a
missing values in a spreadsheet. You need to know that
NA==x is always NAis.na() to test for NA values,
na.omit() to drop them, and the optional na.rm
argument in some functions (mean, sum,
median …)R has a big suite of functions for creating, testing and changing
representations. These have names like factor(),
as.numeric() and is.character().
Factors are logically variables with a fixed number of categories. In
R they are represented as integer variables that have a
levels() attribute. In other words, each possible value of
the factor is given an integer, and the variable carries around the code
that allows translation from this integer to its meaning.
This system has advantages:
… and disadvantages:
as.numeric(f) is not the same as
as.numeric(as.character(f))As a general rule, convert variables to factors as late as possible, when you have already fixed typos; combined different data sets; etc..
You should think creatively, and early on, about how to check your data. Is it internally consistent? Are there extreme outliers? Are there typos? Are there certain values that really mean something else?
An American Airlines memo about fuel reporting from the 1980s complained of multiple cases of:
You should think about what you can test, and what you can fix if it’s broken.
Graphical approaches are really useful for data cleaning; we will discuss this more later on.
What R functions do you know that are useful for examination? What are your strategies?
If you’ve decided on a check, embed stopifnot() (or
assertthat package) in your code to throw an error if
something unexpected happens
Even once your data are clean, there can still be a lot of work getting it into a format you want to use. “Data manipulation”, “munging”, or “wrangling” are the terms covering this activity. The RStudio folks have a useful cheat sheet
This is a very, very hard, very general problem; everyone’s data problems and requirements are slightly different. But there are recurring themes.
Some data are in specialized formats, and often already cleaned. We will mention only in passing:
Hadley Wickham has defined a concept of tidy data, and has
recently introduced the tidyr package.
ggplot2 package) in R work best with long formtidyr package
pivot_longer(), pivot_wider()
(reshaping)separate(),
replace_na()dplyr package:
mutate()select()filter()group_by()summarise()arrange()*_join()See part 4 of data
carpentry for ecologists lesson, especially pivoting pictures,
e.g. sp_by_plot_wide %>% pivot_longer(cols = -species_id, names_to = "PLOT", values_to = "MEAN_WT"):
reshape(): wide-to-long and vice versamerge(): join data framesave(): compute averages by groupsubset(), [-indexing: select obs and
varstransform(): modify variables and create new onesaggregate(): split-apply-summarizesplit(), lapply(),
do.call(rbind()): split-apply-combinesort()