Data Science Opportunity

Posted 3 December 2021

Update 4 January 2022 – We are not currently accepting new applications for this position, but we may have similar opportunities soon. If you are interested in this kind of work, please email us.

The McMaster Theobio Lab is part of the CANMOD network, which has provided funding to accelerate our work on the development of epidemiological tools, datasets, and infrastructure for supporting research and public health applications. We are looking for a full time junior data scientist to join our CANMOD data team, which is leading this work. The position is 35 hours a week for 15 months, at an hourly rate of $27.00 to $41.74 depending on experience.

The data team wants to make it easier for infectious disease modellers to get straightforward and convenient access to historical and publicly available infectious disease data, so that modellers can spend more time modelling and less time locating, accessing, and preparing data. There is a tremendous amount of publicly available information that is locked up in typewritten and even handwritten documents. Epidemiologists who would like to make use of this information face enormous data access, digitization, preparation, and quality assurance barriers. Our group has been entering this information into digital data files, which we are curating so that the broader epidemiological community can directly make use of them. See CANMOD Digitization.

The ideal candidate will be excited by and capable of working independently in all of the following areas. Strength in one or two of the following with an ability to contribute to several others would suggest a good fit.

Develop, maintain, and/or optimize open source data digitization pipelines that convert source documents into tidy datasets that are ready for epidemiological modelling and analysis
Oversee and implement data management best practices, to ensure that digitized data are FAIR
Build quality assurance pipelines that can be used to detect and correct errors of data entry and processing, as well as detect cases for which the sources themselves likely contain errors
Develop data pipelines that connect our historical data to existing online streams of current data
Build and maintain an API for programmatic access to digitized data
Build R and/or Python packages – sitting on top of this API – designed to take digitized data as inputs to modelling and analysis projects (e.g. time-series visualization, missing-value imputation)
Develop and/or integrate with web applications – also sitting on top of the API – to provide data searching, visualization, and download capabilities without the need for coding
Build a containerized open tool chain that provides a computing environment that is set up to access curated epidemiological data and connects it with a suite of open source analysis tools

The ideal candidate will also have a strong foundation in R programming, and experience with many/most of the following:

Unix/Linux command line
Version control with git and GitHub
Regular expressions
Docker
Python/Jupyter
Javascript
C/C++
Statistical modelling
Epidemiological modelling
Data visualization

Please apply directly through McMaster HR (click on positions for Staff and search for Job ID 42287). We appreciate all replies to this advertisement, but only applicants under consideration will be contacted.