Pipelines

One of the main concepts of this course is pipelines. This means that the various steps of your project are all carefully recorded and systematized. The basic idea is that you should be able to delete any results of computer calculations at any time, and then quickly re-do it.

Ideally, your project should depend on:

… and nothing else

Advantages of pipelines

Automation

A corollary of the flexibility idea is that you should automate whenever possible. Cutting and pasting, or editing by hand, is not only tedious, error-prone and hard to replicate, it also traps you into depending on certain data sets. DRY (“don’t repeat yourself”) is a programming maxim.

Some examples:

R tools for managing projects

Spreadsheets

R can read and write in spreadsheet form. Usually the best way to do this is with read.csv and write.csv. A good model for working in R is to read in raw data from spreadsheets and manipulate it in R.

Your goal in many cases should be to take raw data from a spreadsheet and manipulate entirely using scripts. The only time to put it back in a spreadsheet would be to share it with colleagues who want to play with your results

Metadata

Data bases

Large, or evolving data sets may need to use database tools. As of 2015,

This is a possible advanced topic

R

If you are following the pipeline philosophy, it will be useful to make scripts that do discrete tasks, and depend on each others’ outputs.

You can do this by using the R commands load and save: if you have an R script that you are happy with and is producing some nice variables, you can write save(var1, var2, vars, file="happy.RData", at the end of that script, and then load("happy.RData") at the beginning of some other scripts.

How should you get more information about these commands?

Co-ordination tools

You should try to have a pipeline-driven project where as many steps as possible are clearly documented and easily replicable.

There are many, many techniques for this, and we will try not to get too deeply into it in the course (unless it’s a Requested Topic).

Some of the tools that we like include:

R markdown

R markdown is a format that allows you to incorporate R code into your documents. This allows you to develop and document your project at the same time, and to allow your documentation to evolve directly into a document that can be submitted to a supervisory committee or a journal.

RStudio makes a great platform for composing R markdown files, and there’s really good documentation on this format at the RStudio web site. To recap some of the most important features of R Markdown:

Make

make is a system for controlling how files are made from other files. It is useful for making your project self-documenting: you write down the steps of your pipeline in a permanent fashion, in order to make them work.

The gnu make documentation is very useful, but not necessarily easy to attack.