Introduce some principles of data management
Introduce some practical, modern tools
Show some examples
Introduce some principles of data management
Introduce some practical, modern tools
Show some examples
Data must be:
Organized
Cleaned
Manipulated
Analyzed
The first three often take the most time, and this is often because researchers undervalue them.
There’s never time to do it right … but there’s always time to do it over!
library(tidyverse)
villageTable <- read_csv("../data/village.csv")
## Rows: 180 Columns: 11 ## ── Column specification ──────────────────────────────────────────────────────── ## Delimiter: "," ## chr (4): area, grid, vname, vu ## dbl (7): village, fup, grouped1, grouped2, scattered1, scattered2, population ## ## ℹ Use `spec()` to retrieve the full column specification for this data. ## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
summary(villageTable)
## area village grid fup ## Length:180 Min. : 1.0 Length:180 Min. : 1.000 ## Class :character 1st Qu.:211.2 Class :character 1st Qu.: 3.000 ## Mode :character Median :362.5 Mode :character Median : 6.000 ## Mean :402.0 Mean : 6.217 ## 3rd Qu.:565.2 3rd Qu.: 7.000 ## Max. :967.0 Max. :92.000 ## vname vu grouped1 grouped2 ## Length:180 Length:180 Min. : 0.00 Min. : 0.0000 ## Class :character Class :character 1st Qu.: 0.00 1st Qu.: 0.0000 ## Mode :character Mode :character Median : 24.50 Median : 0.0000 ## Mean : 37.06 Mean : 0.7833 ## 3rd Qu.: 52.25 3rd Qu.: 1.0000 ## Max. :550.00 Max. :10.0000 ## scattered1 scattered2 population ## Min. : 0.00 Min. :0.00000 Min. : 0.0 ## 1st Qu.: 0.00 1st Qu.:0.00000 1st Qu.: 124.0 ## Median : 6.00 Median :0.00000 Median : 222.0 ## Mean : 14.29 Mean :0.01111 Mean : 287.1 ## 3rd Qu.: 22.00 3rd Qu.:0.00000 3rd Qu.: 375.0 ## Max. :117.00 Max. :1.00000 Max. :2556.0
parasiteTable <- read_csv("../data/parademo.csv")
## Warning: One or more parsing issues, call `problems()` on your data frame for details, ## e.g.: ## dat <- vroom(...) ## problems(dat)
## Rows: 138197 Columns: 39 ## ── Column specification ──────────────────────────────────────────────────────── ## Delimiter: "," ## dbl (36): id, village, compound, survey, person, sch, _add, tribe, hght, sl... ## date (3): datecoll, datep, dateexam ## ## ℹ Use `spec()` to retrieve the full column specification for this data. ## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
summary(parasiteTable)
## id village compound survey person ## Min. : 1 Min. : 51.0 Min. : 1 Min. : 1.00 Min. : 1.000 ## 1st Qu.: 2671 1st Qu.:202.0 1st Qu.: 40 1st Qu.: 6.00 1st Qu.: 2.000 ## Median : 6186 Median :220.0 Median :205 Median :10.00 Median : 5.000 ## Mean : 5980 Mean :337.3 Mean :176 Mean :10.61 Mean : 6.745 ## 3rd Qu.: 8993 3rd Qu.:552.0 3rd Qu.:249 3rd Qu.:15.00 3rd Qu.: 9.000 ## Max. :12849 Max. :806.0 Max. :949 Max. :23.00 Max. :99.000 ## ## sch _add tribe hght ## Min. :0.000 Min. :1.000 Min. :0.00000 Min. : 0.00 ## 1st Qu.:1.000 1st Qu.:2.000 1st Qu.:0.00000 1st Qu.: 0.00 ## Median :1.000 Median :2.000 Median :0.00000 Median : 0.00 ## Mean :1.189 Mean :1.906 Mean :0.08756 Mean : 15.33 ## 3rd Qu.:1.000 3rd Qu.:2.000 3rd Qu.:0.00000 3rd Qu.: 0.00 ## Max. :5.000 Max. :2.000 Max. :3.00000 Max. :985.00 ## ## sleep dist baby alive ## Min. :0.0000 Min. :0.00e+00 Min. :0.000 Min. :0.000000 ## 1st Qu.:1.0000 1st Qu.:0.00e+00 1st Qu.:2.000 1st Qu.:0.000000 ## Median :1.0000 Median :0.00e+00 Median :2.000 Median :0.000000 ## Mean :0.7863 Mean :7.96e-05 Mean :1.714 Mean :0.005442 ## 3rd Qu.:1.0000 3rd Qu.:0.00e+00 3rd Qu.:2.000 3rd Qu.:0.000000 ## Max. :2.0000 Max. :2.00e+00 Max. :2.000 Max. :2.000000 ## ## datecoll coll sliden car1 ## Min. :1970-10-26 Min. : 0 Min. : 0 Min. :0.000 ## 1st Qu.:1971-12-01 1st Qu.: 1 1st Qu.: 14967 1st Qu.:1.000 ## Median :1972-10-11 Median : 1 Median : 47182 Median :3.000 ## Mean :1972-11-26 Mean : 138 Mean : 49504 Mean :2.263 ## 3rd Qu.:1973-08-29 3rd Qu.: 1 3rd Qu.: 81473 3rd Qu.:3.000 ## Max. :1976-12-10 Max. :19000000 Max. :117927 Max. :3.000 ## NA's :7 ## ncar1 car2 ncar2 filt ## Min. : 0.00 Min. :0.000 Min. : 0.000 Min. :0.000 ## 1st Qu.: 0.00 1st Qu.:1.000 1st Qu.: 0.000 1st Qu.:1.000 ## Median : 0.00 Median :3.000 Median : 0.000 Median :3.000 ## Mean : 85.08 Mean :2.298 Mean : 1.888 Mean :2.279 ## 3rd Qu.: 0.00 3rd Qu.:3.000 3rd Qu.: 0.000 3rd Qu.:3.000 ## Max. :30000.00 Max. :3.000 Max. :4200.000 Max. :3.000 ## ## nfilt fever datep micro ## Min. :0.0e+00 Min. :0.000 Min. :1970-10-26 Min. : 0.000 ## 1st Qu.:0.0e+00 1st Qu.:2.000 1st Qu.:1971-11-16 1st Qu.: 5.000 ## Median :0.0e+00 Median :2.000 Median :1972-09-21 Median : 7.000 ## Mean :7.3e-02 Mean :1.694 Mean :1972-11-13 Mean : 7.226 ## 3rd Qu.:0.0e+00 3rd Qu.:2.000 3rd Qu.:1973-08-16 3rd Qu.:10.000 ## Max. :1.0e+04 Max. :2.000 Max. :1976-01-31 Max. :90.000 ## NA's :19703 ## dateexam slide exam pfa ## Min. :1970-01-11 Min. : 0 Min. : 0.0 Min. : 0.00 ## 1st Qu.:1971-09-29 1st Qu.: 14967 1st Qu.:200.0 1st Qu.: 0.00 ## Median :1972-07-07 Median : 47182 Median :200.0 Median : 0.00 ## Mean :1972-07-03 Mean : 49504 Mean :198.3 Mean : 16.05 ## 3rd Qu.:1973-04-02 3rd Qu.: 81473 3rd Qu.:200.0 3rd Qu.: 3.00 ## Max. :1976-05-06 Max. :117927 Max. :462.0 Max. :449.00 ## NA's :38419 ## pfg pm po ppfa ## Min. : 0.0000 Min. : 0.000 Min. : 0.0000 Min. :0.0000 ## 1st Qu.: 0.0000 1st Qu.: 0.000 1st Qu.: 0.0000 1st Qu.:0.0000 ## Median : 0.0000 Median : 0.000 Median : 0.0000 Median :0.0000 ## Mean : 0.5428 Mean : 1.442 Mean : 0.1589 Mean :0.2982 ## 3rd Qu.: 0.0000 3rd Qu.: 0.000 3rd Qu.: 0.0000 3rd Qu.:1.0000 ## Max. :392.0000 Max. :400.000 Max. :396.0000 Max. :1.0000 ## ## ppfg ppm ppo pf ## Min. :0.00000 Min. :0.00000 Min. :0.000000 Min. : 0.00 ## 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.000000 1st Qu.: 0.00 ## Median :0.00000 Median :0.00000 Median :0.000000 Median : 0.00 ## Mean :0.08759 Mean :0.09235 Mean :0.009935 Mean : 16.38 ## 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.000000 3rd Qu.: 3.00 ## Max. :1.00000 Max. :1.00000 Max. :1.000000 Max. :449.00 ## ## smove agec ## Min. :0.00000 Min. :0.00000 ## 1st Qu.:0.00000 1st Qu.:0.00000 ## Median :0.00000 Median :0.00000 ## Mean :0.00657 Mean :0.00419 ## 3rd Qu.:0.00000 3rd Qu.:0.00000 ## Max. :1.00000 Max. :7.00000 ##
These are different functions, and often call for different ways of formatting data
This is why we have computers
Convenient for users
Some amount of redundancy is good (to catch errors)
Often a “wide” format (related information on the same row)
Scanning for patterns and problems
Making tables for publication
Making data frames for statistical analysis
You don’t want any redundancy in your database (although you might for entry or analysis)
Removing redundancies is a great way to check data integrity
Non-redundant information can be updated conveniently
head(parasiteTable)
## # A tibble: 6 × 39 ## id village compound survey person sch `_add` tribe hght sleep dist ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 1 801 401 1 1 1 1 1 0 1 0 ## 2 1 801 401 2 1 1 2 0 0 1 0 ## 3 1 801 401 3 1 1 2 0 0 1 0 ## 4 1 801 401 4 1 1 2 0 0 1 0 ## 5 1 801 401 5 1 1 2 0 0 1 0 ## 6 1 801 401 6 1 1 2 0 0 1 0 ## # ℹ 28 more variables: baby <dbl>, alive <dbl>, datecoll <date>, coll <dbl>, ## # sliden <dbl>, car1 <dbl>, ncar1 <dbl>, car2 <dbl>, ncar2 <dbl>, filt <dbl>, ## # nfilt <dbl>, fever <dbl>, datep <date>, micro <dbl>, dateexam <date>, ## # slide <dbl>, exam <dbl>, pfa <dbl>, pfg <dbl>, pm <dbl>, po <dbl>, ## # ppfa <dbl>, ppfg <dbl>, ppm <dbl>, ppo <dbl>, pf <dbl>, smove <dbl>, ## # agec <dbl>
No redundancy
Break data into separate tables following the logic of the data
Clear “keys” to describe the information in each row
Break data into logical tables
print(sim)
## # A tibble: 1,001 × 4 ## time S I R ## <dbl> <dbl> <dbl> <dbl> ## 1 0 390000 100 389900 ## 2 0.1 389991 108 389901 ## 3 0.2 389981. 117. 389902. ## 4 0.3 389971. 126. 389903. ## 5 0.4 389959. 136. 389905. ## 6 0.5 389947. 147. 389906. ## 7 0.6 389934. 159. 389907. ## 8 0.7 389920. 171. 389909. ## 9 0.8 389904. 185. 389911. ## 10 0.9 389888. 200. 389912. ## # ℹ 991 more rows
gather
has been replaced by pivot_longerlongsim <- pivot_longer(sim, cols=!time, names_to="class", values_to="people") print(longsim)
## # A tibble: 3,003 × 3 ## time class people ## <dbl> <chr> <dbl> ## 1 0 S 390000 ## 2 0 I 100 ## 3 0 R 389900 ## 4 0.1 S 389991 ## 5 0.1 I 108 ## 6 0.1 R 389901 ## 7 0.2 S 389981. ## 8 0.2 I 117. ## 9 0.2 R 389902. ## 10 0.3 S 389971. ## # ℹ 2,993 more rows
spread
has been replaced by pivot_widerprint(pivot_wider(longsim, names_from=class, values_from=people))
## # A tibble: 1,001 × 4 ## time S I R ## <dbl> <dbl> <dbl> <dbl> ## 1 0 390000 100 389900 ## 2 0.1 389991 108 389901 ## 3 0.2 389981. 117. 389902. ## 4 0.3 389971. 126. 389903. ## 5 0.4 389959. 136. 389905. ## 6 0.5 389947. 147. 389906. ## 7 0.6 389934. 159. 389907. ## 8 0.7 389920. 171. 389909. ## 9 0.8 389904. 185. 389911. ## 10 0.9 389888. 200. 389912. ## # ℹ 991 more rows
Put input data into a tidy, relational form
Clean, maintain and merge data in the database
Export files for analysis, presentation, etc.
villageTable <- (villageTable %>% mutate(village=as.factor(village)) ) summary(villageTable)
## area village grid fup ## Length:180 1 : 1 Length:180 Min. : 1.000 ## Class :character 2 : 1 Class :character 1st Qu.: 3.000 ## Mode :character 3 : 1 Mode :character Median : 6.000 ## 4 : 1 Mean : 6.217 ## 5 : 1 3rd Qu.: 7.000 ## 6 : 1 Max. :92.000 ## (Other):174 ## vname vu grouped1 grouped2 ## Length:180 Length:180 Min. : 0.00 Min. : 0.0000 ## Class :character Class :character 1st Qu.: 0.00 1st Qu.: 0.0000 ## Mode :character Mode :character Median : 24.50 Median : 0.0000 ## Mean : 37.06 Mean : 0.7833 ## 3rd Qu.: 52.25 3rd Qu.: 1.0000 ## Max. :550.00 Max. :10.0000 ## ## scattered1 scattered2 population ## Min. : 0.00 Min. :0.00000 Min. : 0.0 ## 1st Qu.: 0.00 1st Qu.:0.00000 1st Qu.: 124.0 ## Median : 6.00 Median :0.00000 Median : 222.0 ## Mean : 14.29 Mean :0.01111 Mean : 287.1 ## 3rd Qu.: 22.00 3rd Qu.:0.00000 3rd Qu.: 375.0 ## Max. :117.00 Max. :1.00000 Max. :2556.0 ##
print(villageTable %>% group_by(vname, vu) %>% summarize(count = n()) )
## `summarise()` has grouped output by 'vname'. You can override using the ## `.groups` argument.
## # A tibble: 180 × 3 ## # Groups: vname [178] ## vname vu count ## <chr> <chr> <int> ## 1 Adinis Fulanin Kar 1 ## 2 Aduwa Aju 1 ## 3 Ajura Aju 1 ## 4 Anju Dok 1 ## 5 Arindi Sug 1 ## 6 Asayaya Kit 1 ## 7 Ba'wa Dok 1 ## 8 Bakan Sabara Kar 1 ## 9 Bakan Sabara II Kar 1 ## 10 Balalashi Dok 1 ## # ℹ 170 more rows
print(villageTable %>% count(vname, vu, name="count") )
## # A tibble: 180 × 3 ## vname vu count ## <chr> <chr> <int> ## 1 Adinis Fulanin Kar 1 ## 2 Aduwa Aju 1 ## 3 Ajura Aju 1 ## 4 Anju Dok 1 ## 5 Arindi Sug 1 ## 6 Asayaya Kit 1 ## 7 Ba'wa Dok 1 ## 8 Bakan Sabara Kar 1 ## 9 Bakan Sabara II Kar 1 ## 10 Balalashi Dok 1 ## # ℹ 170 more rows
print(villageTable %>% count(vname, vu, name="count") %>% filter(count>1) )
## # A tibble: 0 × 3 ## # ℹ 3 variables: vname <chr>, vu <chr>, count <int>
print(villageTable %>% count(vname, name="count") %>% filter(count>1) )
## # A tibble: 2 × 2 ## vname count ## <chr> <int> ## 1 Daurawa 2 ## 2 Sabon Gari 2
If we had data like the parasiteTable
, we would want to break it into
A table at the level of people
A table at the level of sampling events
A table at the level of measurements
This process leads to improved clarity, and improved cleaning
print(parasiteTable %>% count(id, village, compound, name="count") )
## # A tibble: 13,656 × 4 ## id village compound count ## <dbl> <dbl> <dbl> <int> ## 1 1 801 401 16 ## 2 2 801 401 16 ## 3 3 801 401 16 ## 4 4 801 401 16 ## 5 5 801 401 16 ## 6 6 801 401 16 ## 7 7 801 401 12 ## 8 8 801 401 11 ## 9 9 801 401 15 ## 10 9 801 417 1 ## # ℹ 13,646 more rows
print(parasiteTable %>% count(id, village, compound, name="count") %>% count(id, name="count") %>% filter(count>1) %>% arrange(desc(count)) )
## # A tibble: 756 × 2 ## id count ## <dbl> <int> ## 1 6571 4 ## 2 6572 4 ## 3 6573 4 ## 4 6574 4 ## 5 6575 4 ## 6 6576 4 ## 7 6577 4 ## 8 6578 4 ## 9 41 3 ## 10 1081 3 ## # ℹ 746 more rows
print(ggplot(villageTable, aes(x=population)) + geom_histogram() )
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Let’s say you find an obvious error in your data
## country continent count ## 1 Kenya Africa 23 ## 2 Uganda Africa 31 ## 3 Ungada Africa 1 ## 4 Vietnam Asia 16
How should you fix it?
A table:
## country patchcountry ## 1 Ungada Uganda
Fix the problem as far “upstream” as possible
Not the count table (as I’m doing here)
patchTab <- (ctab %>% left_join(cortab) )
## Joining with `by = join_by(country)`
print(patchTab)
## country continent count patchcountry ## 1 Kenya Africa 23 <NA> ## 2 Uganda Africa 31 <NA> ## 3 Ungada Africa 1 Uganda ## 4 Vietnam Asia 16 <NA>
## country continent count patchcountry ## 1 Kenya Africa 23 <NA> ## 2 Uganda Africa 31 <NA> ## 3 Ungada Africa 1 Uganda ## 4 Vietnam Asia 16 <NA>
NA means “Something I don’t know”. You don’t get the answers you expect if you compare things to NA:
"Jonathan" == NA
## [1] NA
ctab <- (patchTab %>% mutate(country=ifelse(is.na(patchcountry), country, patchcountry)) %>% select(-patchcountry) ) print(ctab)
## country continent count ## 1 Kenya Africa 23 ## 2 Uganda Africa 31 ## 3 Uganda Africa 1 ## 4 Vietnam Asia 16
ctab <- (patchTab %>% mutate(country=ifelse(is.na(patchcountry), country, as.character(patchcountry))) %>% select(-patchcountry) ) print(ctab)
## country continent count ## 1 Kenya Africa 23 ## 2 Uganda Africa 31 ## 3 Uganda Africa 1 ## 4 Vietnam Asia 16
R likes to convert characters to factors
Factors can be useful, but they’re also often weird
Be aware of this as a possible problem
The more you use tools from the tidy family, the less factor weirdness you will encounter
You might want to summarize a factor variable into categories that you define
In this case you could use a similar table (Ungada \(\to\) Uganda), but different logic
Print out all the codes for a particular variable, and see what you think of them
print(religTab %>% select(code) %>% distinct() )
## # A tibble: 62 × 1 ## code ## <chr> ## 1 Catholic ## 2 Orthodox ## 3 Roman catholic ## 4 Adventist ## 5 Anglican ## 6 Anglican Church ## 7 Apostolic Sect ## 8 Aventist ## 9 CCAP ## 10 Charismatic ## # ℹ 52 more rows
print(religTab %>% select(code) %>% distinct() %>% filter(grepl("^M", code)) )
## # A tibble: 9 × 1 ## code ## <chr> ## 1 Methodist ## 2 Moslem ## 3 Muslem ## 4 Mulsim ## 5 Muslim ## 6 Muslum ## 7 Musualmane ## 8 Mana ## 9 Mungu
You should start by editing the output from your previous step
print(religTab)
## # A tibble: 62 × 2 ## category code ## <chr> <chr> ## 1 Catholic/Orthodox Catholic ## 2 Catholic/Orthodox Orthodox ## 3 Catholic/Orthodox Roman catholic ## 4 Christian Adventist ## 5 Christian Anglican ## 6 Christian Anglican Church ## 7 Christian Apostolic Sect ## 8 Christian Aventist ## 9 Christian CCAP ## 10 Christian Charismatic ## # ℹ 52 more rows
Start the same way as before.
## person religion ## 1 Ntombikayise Mulsim ## 2 Gao Zhong Anglican ## 3 Mafalda Angler ## 4 Bill Roman catholic
patchTab <- (churchTab %>% left_join(religTab, by=c("religion"="code")) ) %>% print()
## person religion category ## 1 Ntombikayise Mulsim Muslim ## 2 Gao Zhong Anglican Christian ## 3 Mafalda Angler <NA> ## 4 Bill Roman catholic Catholic/Orthodox
## person religion category ## 1 Ntombikayise Mulsim Muslim ## 2 Gao Zhong Anglican Christian ## 3 Mafalda Angler <NA> ## 4 Bill Roman catholic Catholic/Orthodox
But use different logic for replacement
catTab <- (patchTab %>% mutate(religion=category) %>% select(-category) ) %>% print()
## person religion ## 1 Ntombikayise Muslim ## 2 Gao Zhong Christian ## 3 Mafalda <NA> ## 4 Bill Catholic/Orthodox