This is a joint example of (1) rearranging data and (2) graphing data.

Rearranging data

Read data in wide format and play with it

Data are originally from here (it might be possible to scrape this data set using the XML package …).

I also got data on population size in 2011 from Wikipedia .

Read data (from two separate files).

library(tidyverse)
## using col_types=cols() suppresses the messages about which type is which
dat <- read_csv("../docs/data/CA_homicide.csv", col_types=cols())
popdat <- read_csv("../docs/data/CA_popdat.csv", col_types=cols())

These data are in wide format:

head(dat)
## # A tibble: 6 × 6
##   Place                     `2007` `2008` `2009` `2010` `2011`
##   <chr>                      <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
## 1 Canada                      1.8    1.83   1.81   1.62   1.73
## 2 Newfoundland and Labrador   0.59   0.99   0.2    0.78   0.78
## 3 Prince Edward Island        0      1.43   0      0      0.69
## 4 Nova Scotia                 1.39   1.28   1.6    2.22   2.33
## 5 New Brunswick               1.07   0.4    1.6    1.2    1.06
## 6 Quebec                      1.17   1.19   1.12   1.06   1.32

Note that read_csv() has not modified the column names to make them legal R variables names (read.csv() will do that by default: use check.names=FALSE to suppress this)

What if we want combine other information?

rdat <- tibble(Place=dat$Place,
      Region=c("all",rep("Atlantic",4),
               rep("East",2),
               rep("West",4),
               rep("North",3)))
head(rdat)
## # A tibble: 6 × 2
##   Place                     Region  
##   <chr>                     <chr>   
## 1 Canada                    all     
## 2 Newfoundland and Labrador Atlantic
## 3 Prince Edward Island      Atlantic
## 4 Nova Scotia               Atlantic
## 5 New Brunswick             Atlantic
## 6 Quebec                    East

Let’s start by converting the data to long form:

sdat <- (dat
    %>% pivot_longer(names_to="year",values_to="homicides",-Place,
                     names_transform=list(year=as.numeric))
)
head(sdat)
## # A tibble: 6 × 3
##   Place                      year homicides
##   <chr>                     <dbl>     <dbl>
## 1 Canada                     2007      1.8 
## 2 Canada                     2008      1.83
## 3 Canada                     2009      1.81
## 4 Canada                     2010      1.62
## 5 Canada                     2011      1.73
## 6 Newfoundland and Labrador  2007      0.59

(we use names_transform to convert the years back to numeric values)

Now combine all three data sets (full_join will automatically match all columns with identical names across the data sets, but it’s better practice to specify the matching columns explicitly).

sdat2 <- sdat %>%
    full_join(rdat,by="Place") %>%
    full_join(popdat,by="Place")

If we just used the original data set (without the added stuff), it’s fairly easy to get summary statistics by dropping the first row (so that we have a data frame that is all numeric) and computing means of rows and columns:

dmat <- as.matrix(dat[,-1])
rownames(dmat) <- dat$Place
rowMeans(dmat)  ## means by place
##                    Canada Newfoundland and Labrador      Prince Edward Island 
##                     1.758                     0.668                     0.424 
##               Nova Scotia             New Brunswick                    Quebec 
##                     1.764                     1.066                     1.172 
##                   Ontario                  Manitoba              Saskatchewan 
##                     1.386                     4.432                     3.262 
##                   Alberta          British Columbia                     Yukon 
##                     2.622                     2.218                     4.806 
##     Northwest Territories                   Nunavut 
##                     5.038                    18.584
colMeans(dmat)  ## means by year
##     2007     2008     2009     2010     2011 
## 3.812143 3.587857 3.588571 3.040000 3.542857

(Don’t forget the na.rm argument, unnecessary in this case, that can be provided to most R summary functions to get them to ignore NA values.)

If we want summary statistics from the full data set we can do

sdat2 %>%
    group_by(Place) %>%
    summarise(homicides=mean(homicides))
## fancier
sdat2 %>%
    group_by(year) %>%
    summarise(across(homicides,list(mean=mean,sd=sd)))

One more useful technique is reordering factors (representing categorical variables) in a sensible way. Right now the ‘places’ (provinces, territories, etc.) are ordered alphabetically, R’s default.

sdat3 <- sdat2 %>%
    mutate(Place=forcats::fct_reorder(Place,Pop_2011))
## equivalent
sdat3 <- sdat2 %>%
    mutate(across(Place,~forcats::fct_reorder(.,Pop_2011)))

This will be useful in the future, but is different from the order the data frame is stored in, which we can modify via arrange() (use desc(Pop_2011) to arrange in descending order of population):

(sdat3
    %>% arrange(desc(Pop_2011))
    %>% head()
)
## # A tibble: 6 × 5
##   Place    year homicides Region Pop_2011
##   <fct>   <dbl>     <dbl> <chr>     <dbl>
## 1 Canada   2007      1.8  all    33476688
## 2 Canada   2008      1.83 all    33476688
## 3 Canada   2009      1.81 all    33476688
## 4 Canada   2010      1.62 all    33476688
## 5 Canada   2011      1.73 all    33476688
## 6 Ontario  2007      1.58 East   12851821

I can also summarise by combinations of variables:

sdat3 %>% group_by(year,Region) %>%
    summarise(across(homicides,mean),.groups="drop")
## # A tibble: 25 × 3
##     year Region   homicides
##    <dbl> <chr>        <dbl>
##  1  2007 Atlantic     0.762
##  2  2007 East         1.38 
##  3  2007 North       11.0  
##  4  2007 West         3.16 
##  5  2007 all          1.8  
##  6  2008 Atlantic     1.02 
##  7  2008 East         1.27 
##  8  2008 North        9.53 
##  9  2008 West         3.29 
## 10  2008 all          1.83 
## # ℹ 15 more rows

What if I want the mean and standard error? R doesn’t have a built-in “standard error of the mean” function so I define one when I need it:

sem <- function(x) { sd(x)/sqrt(length(x)) }
region_avgs <- sdat3 %>% group_by(year,Region) %>%
    summarise(across(homicides,list(mean=~mean(.,na.rm=TRUE),
                                    sem=sem)),
              .groups="drop")
head(region_avgs)
## # A tibble: 6 × 4
##    year Region   homicides_mean homicides_sem
##   <dbl> <chr>             <dbl>         <dbl>
## 1  2007 Atlantic          0.762         0.303
## 2  2007 East              1.38          0.205
## 3  2007 North            11.0           5.69 
## 4  2007 West              3.16          0.677
## 5  2007 all               1.8          NA    
## 6  2008 Atlantic          1.02          0.227

Question: why do I have NA values?

Drilling down to check some values:

sdat3 %>% filter(year==2007 & Region=="all")
## # A tibble: 1 × 5
##   Place   year homicides Region Pop_2011
##   <fct>  <dbl>     <dbl> <chr>     <dbl>
## 1 Canada  2007       1.8 all    33476688

Sometimes it’s useful to be able to go from long to wide format. pivot_wider() is the opposite of pivot_longer(): we specify a column in the current data set to spread out into new columns (key) and a column to use as the vales for the table (value)

(region_avgs
    %>% select(-homicides_sem)
    %>% pivot_wider(names_from=Region,values_from=homicides_mean)
)
## # A tibble: 5 × 6
##    year Atlantic  East North  West   all
##   <dbl>    <dbl> <dbl> <dbl> <dbl> <dbl>
## 1  2007    0.762  1.38 11.0   3.16  1.8 
## 2  2008    1.02   1.27  9.53  3.29  1.83
## 3  2009    0.85   1.24  9.71  3.36  1.81
## 4  2010    1.05   1.25  7.81  2.70  1.62
## 5  2011    1.22   1.26  9.29  3.15  1.73

Save the results:

saveRDS(sdat3,file="../docs/data/CA_homicide.rds")

For analysis in R, it is generally best to keep your data in long format and pivot_wider() it as necessary (e.g. when creating a human-readable table for output).

Pictures

Load data from previous computations (can also get it here

mdat <- readRDS("../docs/data/CA_homicide.rds")

One of the advantages of long format is that it allows us to use some of R’s more powerful graphics tools such as the ggplot2 and lattice packages (and it’s what most statistics packages expect):

library(ggplot2)
theme_set(theme_bw())  ## black-and-white theme
## set up basic plot:
p1 <- ggplot(mdat,aes(year,homicides,colour=Place))

Unlike base plots (which can only be saved/replayed through the RStudio interface), ggplot produces an R object which can then be printed (=displayed in a graphics window), or saved (or exported to a graphics file via ggsave()):

print(p1+geom_line())

We could add both lines and points:

print(p1+geom_line() +geom_point())

Might be better on a log scale, with a sensible y-axis label:

p1L <- (p1
    + geom_line()
    + scale_y_log10()
    + labs(y="Homicides per 100,000 population")
)
print(p1L)
## Warning in scale_y_log10(): log-10 transformation introduced infinite values.

(Zero values get squashed against the lower axis)

Maybe we don’t care about time at all:

b1 <- (ggplot(mdat,aes(x=Place,y=homicides,
                       colour=Region))
    + geom_boxplot(outlier.colour=NULL)  ## set outlier points to same colour as boxes
    + scale_y_log10()
    + labs(y="Homicides per 100,000 population")
)
print(b1)
## Warning in scale_y_log10(): log-10 transformation introduced infinite values.
## Warning: Removed 4 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

The x-axis tick labels overlap enough to be unreadable (unless we resize the plot to be ridiculously long and narrow).

We could rotate them 90 degrees to be vertical:

b1_vertlabels <- b1+theme(axis.text.x=element_text(angle=90))
print(b1_vertlabels)
## Warning in scale_y_log10(): log-10 transformation introduced infinite values.
## Warning: Removed 4 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

In general if you want to tweak a ggplot plot, Google it or search the ggplot theme documentation or the ggplot cheat sheet for more information …

Rotating the whole plot is less familiar, but arguably better. Here I’m also (1) changing the colour palette and (2) changing the order of the Place variable, using %+% to substitute a different set of data into an existing plot:

mdat_sort <- mdat %>% mutate(across(Place,~forcats::fct_reorder(.,homicides)))
print(b1
      %+% mdat_sort  ## substitute sorted data
      + coord_flip()      ## rotate entire plot
      + xlab("")          ## x-label redundant
      + scale_colour_brewer(palette="Dark2") ## change palette
      )

Maybe we want to make our line graph less busy:

print(p1L+facet_wrap(~Region))

We could also code population size by line width:

p2 <- (ggplot(mdat,
             aes(year,homicides,colour=Region,size=Pop_2011,
                 group=Place))
    + geom_line(alpha=0.5)
    + scale_y_log10()
    + scale_size_continuous(trans="log10")
    + labs(y="Homicides per 100,000 population")
)
print(p2)

Using the directlabels package:

library(directlabels)
last.bumpup <- list("last.points","bumpup")
print(p1L
    + expand_limits(x=2014)  ## add a little more space
    + geom_dl(aes(label=Place), method="last.bumpup") 
    + theme(legend.position="none")  ## don't need the legend any more
)

We’d have to work a little harder to avoid clipping the “Yukon” label …