This is a joint example of (1) rearranging data and (2) graphing data.
Data are originally from here
(it might be possible to scrape this data set using the
XML
package …).
I also got data on population size in 2011 from Wikipedia .
Read data (from two separate files).
library(tidyverse)
## using col_types=cols() suppresses the messages about which type is which
dat <- read_csv("../docs/data/CA_homicide.csv", col_types=cols())
popdat <- read_csv("../docs/data/CA_popdat.csv", col_types=cols())
These data are in wide format:
head(dat)
## # A tibble: 6 × 6
## Place `2007` `2008` `2009` `2010` `2011`
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Canada 1.8 1.83 1.81 1.62 1.73
## 2 Newfoundland and Labrador 0.59 0.99 0.2 0.78 0.78
## 3 Prince Edward Island 0 1.43 0 0 0.69
## 4 Nova Scotia 1.39 1.28 1.6 2.22 2.33
## 5 New Brunswick 1.07 0.4 1.6 1.2 1.06
## 6 Quebec 1.17 1.19 1.12 1.06 1.32
Note that read_csv()
has not modified
the column names to make them legal R variables names
(read.csv()
will do that by default: use
check.names=FALSE
to suppress this)
What if we want combine other information?
rdat <- tibble(Place=dat$Place,
Region=c("all",rep("Atlantic",4),
rep("East",2),
rep("West",4),
rep("North",3)))
head(rdat)
## # A tibble: 6 × 2
## Place Region
## <chr> <chr>
## 1 Canada all
## 2 Newfoundland and Labrador Atlantic
## 3 Prince Edward Island Atlantic
## 4 Nova Scotia Atlantic
## 5 New Brunswick Atlantic
## 6 Quebec East
Let’s start by converting the data to long form:
sdat <- (dat
%>% pivot_longer(names_to="year",values_to="homicides",-Place,
names_transform=list(year=as.numeric))
)
head(sdat)
## # A tibble: 6 × 3
## Place year homicides
## <chr> <dbl> <dbl>
## 1 Canada 2007 1.8
## 2 Canada 2008 1.83
## 3 Canada 2009 1.81
## 4 Canada 2010 1.62
## 5 Canada 2011 1.73
## 6 Newfoundland and Labrador 2007 0.59
(we use names_transform
to convert the years back to
numeric values)
Now combine all three data sets (full_join
will
automatically match all columns with identical names across the data
sets, but it’s better practice to specify the matching columns
explicitly).
sdat2 <- sdat %>%
full_join(rdat,by="Place") %>%
full_join(popdat,by="Place")
If we just used the original data set (without the added stuff), it’s fairly easy to get summary statistics by dropping the first row (so that we have a data frame that is all numeric) and computing means of rows and columns:
dmat <- as.matrix(dat[,-1])
rownames(dmat) <- dat$Place
rowMeans(dmat) ## means by place
## Canada Newfoundland and Labrador Prince Edward Island
## 1.758 0.668 0.424
## Nova Scotia New Brunswick Quebec
## 1.764 1.066 1.172
## Ontario Manitoba Saskatchewan
## 1.386 4.432 3.262
## Alberta British Columbia Yukon
## 2.622 2.218 4.806
## Northwest Territories Nunavut
## 5.038 18.584
colMeans(dmat) ## means by year
## 2007 2008 2009 2010 2011
## 3.812143 3.587857 3.588571 3.040000 3.542857
(Don’t forget the na.rm
argument, unnecessary in this
case, that can be provided to most R summary functions to get them to
ignore NA
values.)
If we want summary statistics from the full data set we can do
sdat2 %>%
group_by(Place) %>%
summarise(homicides=mean(homicides))
## fancier
sdat2 %>%
group_by(year) %>%
summarise(across(homicides,list(mean=mean,sd=sd)))
One more useful technique is reordering factors (representing categorical variables) in a sensible way. Right now the ‘places’ (provinces, territories, etc.) are ordered alphabetically, R’s default.
sdat3 <- sdat2 %>%
mutate(Place=forcats::fct_reorder(Place,Pop_2011))
## equivalent
sdat3 <- sdat2 %>%
mutate(across(Place,~forcats::fct_reorder(.,Pop_2011)))
This will be useful in the future, but is different from the
order the data frame is stored in, which we can modify
via arrange()
(use desc(Pop_2011)
to arrange
in descending order of population):
(sdat3
%>% arrange(desc(Pop_2011))
%>% head()
)
## # A tibble: 6 × 5
## Place year homicides Region Pop_2011
## <fct> <dbl> <dbl> <chr> <dbl>
## 1 Canada 2007 1.8 all 33476688
## 2 Canada 2008 1.83 all 33476688
## 3 Canada 2009 1.81 all 33476688
## 4 Canada 2010 1.62 all 33476688
## 5 Canada 2011 1.73 all 33476688
## 6 Ontario 2007 1.58 East 12851821
I can also summarise by combinations of variables:
sdat3 %>% group_by(year,Region) %>%
summarise(across(homicides,mean),.groups="drop")
## # A tibble: 25 × 3
## year Region homicides
## <dbl> <chr> <dbl>
## 1 2007 Atlantic 0.762
## 2 2007 East 1.38
## 3 2007 North 11.0
## 4 2007 West 3.16
## 5 2007 all 1.8
## 6 2008 Atlantic 1.02
## 7 2008 East 1.27
## 8 2008 North 9.53
## 9 2008 West 3.29
## 10 2008 all 1.83
## # ℹ 15 more rows
What if I want the mean and standard error? R doesn’t have a built-in “standard error of the mean” function so I define one when I need it:
sem <- function(x) { sd(x)/sqrt(length(x)) }
region_avgs <- sdat3 %>% group_by(year,Region) %>%
summarise(across(homicides,list(mean=~mean(.,na.rm=TRUE),
sem=sem)),
.groups="drop")
head(region_avgs)
## # A tibble: 6 × 4
## year Region homicides_mean homicides_sem
## <dbl> <chr> <dbl> <dbl>
## 1 2007 Atlantic 0.762 0.303
## 2 2007 East 1.38 0.205
## 3 2007 North 11.0 5.69
## 4 2007 West 3.16 0.677
## 5 2007 all 1.8 NA
## 6 2008 Atlantic 1.02 0.227
Question: why do I have NA values?
Drilling down to check some values:
sdat3 %>% filter(year==2007 & Region=="all")
## # A tibble: 1 × 5
## Place year homicides Region Pop_2011
## <fct> <dbl> <dbl> <chr> <dbl>
## 1 Canada 2007 1.8 all 33476688
Sometimes it’s useful to be able to go from long to wide format.
pivot_wider()
is the opposite of
pivot_longer()
: we specify a column in the current data set
to spread out into new columns (key
) and a column to use as
the vales for the table (value
)
(region_avgs
%>% select(-homicides_sem)
%>% pivot_wider(names_from=Region,values_from=homicides_mean)
)
## # A tibble: 5 × 6
## year Atlantic East North West all
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2007 0.762 1.38 11.0 3.16 1.8
## 2 2008 1.02 1.27 9.53 3.29 1.83
## 3 2009 0.85 1.24 9.71 3.36 1.81
## 4 2010 1.05 1.25 7.81 2.70 1.62
## 5 2011 1.22 1.26 9.29 3.15 1.73
Save the results:
saveRDS(sdat3,file="../docs/data/CA_homicide.rds")
For analysis in R, it is generally best to keep your data in long
format and pivot_wider()
it as necessary (e.g. when
creating a human-readable table for output).
Load data from previous computations (can also get it here
mdat <- readRDS("../docs/data/CA_homicide.rds")
One of the advantages of long format is that it allows us to use some
of R’s more powerful graphics tools such as the ggplot2
and
lattice
packages (and it’s what most statistics packages
expect):
library(ggplot2)
theme_set(theme_bw()) ## black-and-white theme
## set up basic plot:
p1 <- ggplot(mdat,aes(year,homicides,colour=Place))
Unlike base plots (which can only be saved/replayed through the
RStudio interface), ggplot
produces an R object which can
then be printed (=displayed in a graphics window), or saved (or exported
to a graphics file via ggsave()
):
print(p1+geom_line())
We could add both lines and points:
print(p1+geom_line() +geom_point())
Might be better on a log scale, with a sensible y-axis label:
p1L <- (p1
+ geom_line()
+ scale_y_log10()
+ labs(y="Homicides per 100,000 population")
)
print(p1L)
## Warning in scale_y_log10(): log-10 transformation introduced infinite values.
(Zero values get squashed against the lower axis)
Maybe we don’t care about time at all:
b1 <- (ggplot(mdat,aes(x=Place,y=homicides,
colour=Region))
+ geom_boxplot(outlier.colour=NULL) ## set outlier points to same colour as boxes
+ scale_y_log10()
+ labs(y="Homicides per 100,000 population")
)
print(b1)
## Warning in scale_y_log10(): log-10 transformation introduced infinite values.
## Warning: Removed 4 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
The x-axis tick labels overlap enough to be unreadable (unless we resize the plot to be ridiculously long and narrow).
We could rotate them 90 degrees to be vertical:
b1_vertlabels <- b1+theme(axis.text.x=element_text(angle=90))
print(b1_vertlabels)
## Warning in scale_y_log10(): log-10 transformation introduced infinite values.
## Warning: Removed 4 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
In general if you want to tweak a ggplot
plot, Google it
or search the ggplot theme
documentation or the ggplot
cheat sheet for more information …
Rotating the whole plot is less familiar, but arguably better. Here
I’m also (1) changing the colour palette and (2) changing the order of
the Place
variable, using %+%
to substitute a
different set of data into an existing plot:
mdat_sort <- mdat %>% mutate(across(Place,~forcats::fct_reorder(.,homicides)))
print(b1
%+% mdat_sort ## substitute sorted data
+ coord_flip() ## rotate entire plot
+ xlab("") ## x-label redundant
+ scale_colour_brewer(palette="Dark2") ## change palette
)
Maybe we want to make our line graph less busy:
print(p1L+facet_wrap(~Region))
We could also code population size by line width:
p2 <- (ggplot(mdat,
aes(year,homicides,colour=Region,size=Pop_2011,
group=Place))
+ geom_line(alpha=0.5)
+ scale_y_log10()
+ scale_size_continuous(trans="log10")
+ labs(y="Homicides per 100,000 population")
)
print(p2)
Using the directlabels package:
library(directlabels)
last.bumpup <- list("last.points","bumpup")
print(p1L
+ expand_limits(x=2014) ## add a little more space
+ geom_dl(aes(label=Place), method="last.bumpup")
+ theme(legend.position="none") ## don't need the legend any more
)
We’d have to work a little harder to avoid clipping the “Yukon” label …