Data management and cleaning

Goals

Introduce some principles of data management
Introduce some practical, modern tools
Show some examples

Data analysis

Data must be:

Organized
Cleaned
Manipulated
Analyzed

The first three often take the most time, and this is often because researchers undervalue them.

There’s never time to do it right … but there’s always time to do it over!

Structured tools for dealing with data

dplyr
tidyr
readr

library(tidyverse)

A sample data set

villageTable <- read_csv("../data/village.csv")

## Rows: 180 Columns: 11
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): area, grid, vname, vu
## dbl (7): village, fup, grouped1, grouped2, scattered1, scattered2, population
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

A sample data set

summary(villageTable)

##      area              village          grid                fup        
##  Length:180         Min.   :  1.0   Length:180         Min.   : 1.000  
##  Class :character   1st Qu.:211.2   Class :character   1st Qu.: 3.000  
##  Mode  :character   Median :362.5   Mode  :character   Median : 6.000  
##                     Mean   :402.0                      Mean   : 6.217  
##                     3rd Qu.:565.2                      3rd Qu.: 7.000  
##                     Max.   :967.0                      Max.   :92.000  
##     vname                vu               grouped1         grouped2      
##  Length:180         Length:180         Min.   :  0.00   Min.   : 0.0000  
##  Class :character   Class :character   1st Qu.:  0.00   1st Qu.: 0.0000  
##  Mode  :character   Mode  :character   Median : 24.50   Median : 0.0000  
##                                        Mean   : 37.06   Mean   : 0.7833  
##                                        3rd Qu.: 52.25   3rd Qu.: 1.0000  
##                                        Max.   :550.00   Max.   :10.0000  
##    scattered1       scattered2        population    
##  Min.   :  0.00   Min.   :0.00000   Min.   :   0.0  
##  1st Qu.:  0.00   1st Qu.:0.00000   1st Qu.: 124.0  
##  Median :  6.00   Median :0.00000   Median : 222.0  
##  Mean   : 14.29   Mean   :0.01111   Mean   : 287.1  
##  3rd Qu.: 22.00   3rd Qu.:0.00000   3rd Qu.: 375.0  
##  Max.   :117.00   Max.   :1.00000   Max.   :2556.0

A sample data set

parasiteTable <- read_csv("../data/parademo.csv")

## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
##   dat <- vroom(...)
##   problems(dat)

## Rows: 138197 Columns: 39
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl  (36): id, village, compound, survey, person, sch, _add, tribe, hght, sl...
## date  (3): datecoll, datep, dateexam
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

A sample data set

summary(parasiteTable)

##        id           village         compound       survey          person      
##  Min.   :    1   Min.   : 51.0   Min.   :  1   Min.   : 1.00   Min.   : 1.000  
##  1st Qu.: 2671   1st Qu.:202.0   1st Qu.: 40   1st Qu.: 6.00   1st Qu.: 2.000  
##  Median : 6186   Median :220.0   Median :205   Median :10.00   Median : 5.000  
##  Mean   : 5980   Mean   :337.3   Mean   :176   Mean   :10.61   Mean   : 6.745  
##  3rd Qu.: 8993   3rd Qu.:552.0   3rd Qu.:249   3rd Qu.:15.00   3rd Qu.: 9.000  
##  Max.   :12849   Max.   :806.0   Max.   :949   Max.   :23.00   Max.   :99.000  
##                                                                                
##       sch             _add           tribe              hght       
##  Min.   :0.000   Min.   :1.000   Min.   :0.00000   Min.   :  0.00  
##  1st Qu.:1.000   1st Qu.:2.000   1st Qu.:0.00000   1st Qu.:  0.00  
##  Median :1.000   Median :2.000   Median :0.00000   Median :  0.00  
##  Mean   :1.189   Mean   :1.906   Mean   :0.08756   Mean   : 15.33  
##  3rd Qu.:1.000   3rd Qu.:2.000   3rd Qu.:0.00000   3rd Qu.:  0.00  
##  Max.   :5.000   Max.   :2.000   Max.   :3.00000   Max.   :985.00  
##                                                                    
##      sleep             dist               baby           alive         
##  Min.   :0.0000   Min.   :0.00e+00   Min.   :0.000   Min.   :0.000000  
##  1st Qu.:1.0000   1st Qu.:0.00e+00   1st Qu.:2.000   1st Qu.:0.000000  
##  Median :1.0000   Median :0.00e+00   Median :2.000   Median :0.000000  
##  Mean   :0.7863   Mean   :7.96e-05   Mean   :1.714   Mean   :0.005442  
##  3rd Qu.:1.0000   3rd Qu.:0.00e+00   3rd Qu.:2.000   3rd Qu.:0.000000  
##  Max.   :2.0000   Max.   :2.00e+00   Max.   :2.000   Max.   :2.000000  
##                                                                        
##     datecoll               coll              sliden            car1      
##  Min.   :1970-10-26   Min.   :       0   Min.   :     0   Min.   :0.000  
##  1st Qu.:1971-12-01   1st Qu.:       1   1st Qu.: 14967   1st Qu.:1.000  
##  Median :1972-10-11   Median :       1   Median : 47182   Median :3.000  
##  Mean   :1972-11-26   Mean   :     138   Mean   : 49504   Mean   :2.263  
##  3rd Qu.:1973-08-29   3rd Qu.:       1   3rd Qu.: 81473   3rd Qu.:3.000  
##  Max.   :1976-12-10   Max.   :19000000   Max.   :117927   Max.   :3.000  
##  NA's   :7                                                               
##      ncar1               car2           ncar2               filt      
##  Min.   :    0.00   Min.   :0.000   Min.   :   0.000   Min.   :0.000  
##  1st Qu.:    0.00   1st Qu.:1.000   1st Qu.:   0.000   1st Qu.:1.000  
##  Median :    0.00   Median :3.000   Median :   0.000   Median :3.000  
##  Mean   :   85.08   Mean   :2.298   Mean   :   1.888   Mean   :2.279  
##  3rd Qu.:    0.00   3rd Qu.:3.000   3rd Qu.:   0.000   3rd Qu.:3.000  
##  Max.   :30000.00   Max.   :3.000   Max.   :4200.000   Max.   :3.000  
##                                                                       
##      nfilt             fever           datep                micro       
##  Min.   :0.0e+00   Min.   :0.000   Min.   :1970-10-26   Min.   : 0.000  
##  1st Qu.:0.0e+00   1st Qu.:2.000   1st Qu.:1971-11-16   1st Qu.: 5.000  
##  Median :0.0e+00   Median :2.000   Median :1972-09-21   Median : 7.000  
##  Mean   :7.3e-02   Mean   :1.694   Mean   :1972-11-13   Mean   : 7.226  
##  3rd Qu.:0.0e+00   3rd Qu.:2.000   3rd Qu.:1973-08-16   3rd Qu.:10.000  
##  Max.   :1.0e+04   Max.   :2.000   Max.   :1976-01-31   Max.   :90.000  
##                                    NA's   :19703                        
##     dateexam              slide             exam            pfa        
##  Min.   :1970-01-11   Min.   :     0   Min.   :  0.0   Min.   :  0.00  
##  1st Qu.:1971-09-29   1st Qu.: 14967   1st Qu.:200.0   1st Qu.:  0.00  
##  Median :1972-07-07   Median : 47182   Median :200.0   Median :  0.00  
##  Mean   :1972-07-03   Mean   : 49504   Mean   :198.3   Mean   : 16.05  
##  3rd Qu.:1973-04-02   3rd Qu.: 81473   3rd Qu.:200.0   3rd Qu.:  3.00  
##  Max.   :1976-05-06   Max.   :117927   Max.   :462.0   Max.   :449.00  
##  NA's   :38419                                                         
##       pfg                 pm                po                ppfa       
##  Min.   :  0.0000   Min.   :  0.000   Min.   :  0.0000   Min.   :0.0000  
##  1st Qu.:  0.0000   1st Qu.:  0.000   1st Qu.:  0.0000   1st Qu.:0.0000  
##  Median :  0.0000   Median :  0.000   Median :  0.0000   Median :0.0000  
##  Mean   :  0.5428   Mean   :  1.442   Mean   :  0.1589   Mean   :0.2982  
##  3rd Qu.:  0.0000   3rd Qu.:  0.000   3rd Qu.:  0.0000   3rd Qu.:1.0000  
##  Max.   :392.0000   Max.   :400.000   Max.   :396.0000   Max.   :1.0000  
##                                                                          
##       ppfg              ppm               ppo                 pf        
##  Min.   :0.00000   Min.   :0.00000   Min.   :0.000000   Min.   :  0.00  
##  1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.000000   1st Qu.:  0.00  
##  Median :0.00000   Median :0.00000   Median :0.000000   Median :  0.00  
##  Mean   :0.08759   Mean   :0.09235   Mean   :0.009935   Mean   : 16.38  
##  3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0.000000   3rd Qu.:  3.00  
##  Max.   :1.00000   Max.   :1.00000   Max.   :1.000000   Max.   :449.00  
##                                                                         
##      smove              agec        
##  Min.   :0.00000   Min.   :0.00000  
##  1st Qu.:0.00000   1st Qu.:0.00000  
##  Median :0.00000   Median :0.00000  
##  Mean   :0.00657   Mean   :0.00419  
##  3rd Qu.:0.00000   3rd Qu.:0.00000  
##  Max.   :1.00000   Max.   :7.00000  
##

Entering, maintaining and using data

These are different functions, and often call for different ways of formatting data
This is why we have computers

Data entry

Convenient for users
Some amount of redundancy is good (to catch errors)
- Some amount of redundancy is good (to catch errors)
Often a “wide” format (related information on the same row)

Data use

Scanning for patterns and problems
Making tables for publication
Making data frames for statistical analysis

Data maintenance

Your database is
- a logical construct
  - The core of the data that we maintain and validate
- maybe also a software construct
  - If you are using a database program
Data in the database is:
- Non-redundant
- Relational
- Often in a “long” format (to signal values that are comparable)

Redundancy

You don’t want any redundancy in your database (although you might for entry or analysis)

Removing redundancies is a great way to check data integrity
Non-redundant information can be updated conveniently
- and without introducing new errors
- and without introducing inconsistencies

Example

head(parasiteTable)

## # A tibble: 6 × 39
##      id village compound survey person   sch `_add` tribe  hght sleep  dist
##   <dbl>   <dbl>    <dbl>  <dbl>  <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl>
## 1     1     801      401      1      1     1      1     1     0     1     0
## 2     1     801      401      2      1     1      2     0     0     1     0
## 3     1     801      401      3      1     1      2     0     0     1     0
## 4     1     801      401      4      1     1      2     0     0     1     0
## 5     1     801      401      5      1     1      2     0     0     1     0
## 6     1     801      401      6      1     1      2     0     0     1     0
## # ℹ 28 more variables: baby <dbl>, alive <dbl>, datecoll <date>, coll <dbl>,
## #   sliden <dbl>, car1 <dbl>, ncar1 <dbl>, car2 <dbl>, ncar2 <dbl>, filt <dbl>,
## #   nfilt <dbl>, fever <dbl>, datep <date>, micro <dbl>, dateexam <date>,
## #   slide <dbl>, exam <dbl>, pfa <dbl>, pfg <dbl>, pm <dbl>, po <dbl>,
## #   ppfa <dbl>, ppfg <dbl>, ppm <dbl>, ppo <dbl>, pf <dbl>, smove <dbl>,
## #   agec <dbl>

Tidy data

No redundancy
Break data into separate tables following the logic of the data
Clear “keys” to describe the information in each row
- Non-redundant, unique identifiers
- Village table has village (number) as the key
- Measurement table has id, survey as the key’

Relational

Break data into logical tables

Information at the level where it belongs:
- Sampling event
- Individual
- Household
- Village
- Village group

Tidy example

print(sim)

## # A tibble: 1,001 × 4
##     time       S     I       R
##    <dbl>   <dbl> <dbl>   <dbl>
##  1   0   390000   100  389900 
##  2   0.1 389991   108  389901 
##  3   0.2 389981.  117. 389902.
##  4   0.3 389971.  126. 389903.
##  5   0.4 389959.  136. 389905.
##  6   0.5 389947.  147. 389906.
##  7   0.6 389934.  159. 389907.
##  8   0.7 389920.  171. 389909.
##  9   0.8 389904.  185. 389911.
## 10   0.9 389888.  200. 389912.
## # ℹ 991 more rows

Gathering makes things tidy

Use long structure to mark things that are comparable, or have the same units
gather has been replaced by pivot_longer

longsim <- pivot_longer(sim, cols=!time, names_to="class", values_to="people")
print(longsim)

## # A tibble: 3,003 × 3
##     time class  people
##    <dbl> <chr>   <dbl>
##  1   0   S     390000 
##  2   0   I        100 
##  3   0   R     389900 
##  4   0.1 S     389991 
##  5   0.1 I        108 
##  6   0.1 R     389901 
##  7   0.2 S     389981.
##  8   0.2 I        117.
##  9   0.2 R     389902.
## 10   0.3 S     389971.
## # ℹ 2,993 more rows

Spreading can make things more human-readable

spread has been replaced by pivot_wider

print(pivot_wider(longsim, names_from=class, values_from=people))

## # A tibble: 1,001 × 4
##     time       S     I       R
##    <dbl>   <dbl> <dbl>   <dbl>
##  1   0   390000   100  389900 
##  2   0.1 389991   108  389901 
##  3   0.2 389981.  117. 389902.
##  4   0.3 389971.  126. 389903.
##  5   0.4 389959.  136. 389905.
##  6   0.5 389947.  147. 389906.
##  7   0.6 389934.  159. 389907.
##  8   0.7 389920.  171. 389909.
##  9   0.8 389904.  185. 389911.
## 10   0.9 389888.  200. 389912.
## # ℹ 991 more rows

Flow

Put input data into a tidy, relational form
- Clean on the way
- This is your database, whether you use database software or not
Clean, maintain and merge data in the database
Export files for analysis, presentation, etc.
- You may wind up rebuilding files like the ones you started with
- They are now quality-checked, and may also be easier to update

Importing

Summarize

Check that numbers are numbers, dates are dates, etc.

villageTable <- (villageTable
    %>% mutate(village=as.factor(village))
)
summary(villageTable)

##      area              village        grid                fup        
##  Length:180         1      :  1   Length:180         Min.   : 1.000  
##  Class :character   2      :  1   Class :character   1st Qu.: 3.000  
##  Mode  :character   3      :  1   Mode  :character   Median : 6.000  
##                     4      :  1                      Mean   : 6.217  
##                     5      :  1                      3rd Qu.: 7.000  
##                     6      :  1                      Max.   :92.000  
##                     (Other):174                                      
##     vname                vu               grouped1         grouped2      
##  Length:180         Length:180         Min.   :  0.00   Min.   : 0.0000  
##  Class :character   Class :character   1st Qu.:  0.00   1st Qu.: 0.0000  
##  Mode  :character   Mode  :character   Median : 24.50   Median : 0.0000  
##                                        Mean   : 37.06   Mean   : 0.7833  
##                                        3rd Qu.: 52.25   3rd Qu.: 1.0000  
##                                        Max.   :550.00   Max.   :10.0000  
##                                                                          
##    scattered1       scattered2        population    
##  Min.   :  0.00   Min.   :0.00000   Min.   :   0.0  
##  1st Qu.:  0.00   1st Qu.:0.00000   1st Qu.: 124.0  
##  Median :  6.00   Median :0.00000   Median : 222.0  
##  Mean   : 14.29   Mean   :0.01111   Mean   : 287.1  
##  3rd Qu.: 22.00   3rd Qu.:0.00000   3rd Qu.: 375.0  
##  Max.   :117.00   Max.   :1.00000   Max.   :2556.0  
##

List

Make lists of what occurs with what (and how many times)

print(villageTable
    %>% group_by(vname, vu)
    %>% summarize(count = n())
)

## `summarise()` has grouped output by 'vname'. You can override using the
## `.groups` argument.

## # A tibble: 180 × 3
## # Groups:   vname [178]
##    vname           vu    count
##    <chr>           <chr> <int>
##  1 Adinis Fulanin  Kar       1
##  2 Aduwa           Aju       1
##  3 Ajura           Aju       1
##  4 Anju            Dok       1
##  5 Arindi          Sug       1
##  6 Asayaya         Kit       1
##  7 Ba'wa           Dok       1
##  8 Bakan Sabara    Kar       1
##  9 Bakan Sabara II Kar       1
## 10 Balalashi       Dok       1
## # ℹ 170 more rows

Better counting

print(villageTable
    %>% count(vname, vu, name="count")
)

## # A tibble: 180 × 3
##    vname           vu    count
##    <chr>           <chr> <int>
##  1 Adinis Fulanin  Kar       1
##  2 Aduwa           Aju       1
##  3 Ajura           Aju       1
##  4 Anju            Dok       1
##  5 Arindi          Sug       1
##  6 Asayaya         Kit       1
##  7 Ba'wa           Dok       1
##  8 Bakan Sabara    Kar       1
##  9 Bakan Sabara II Kar       1
## 10 Balalashi       Dok       1
## # ℹ 170 more rows

…

The computer is your friend

print(villageTable
    %>% count(vname, vu, name="count")
    %>% filter(count>1)
)

## # A tibble: 0 × 3
## # ℹ 3 variables: vname <chr>, vu <chr>, count <int>

…

and you always want to check yourself

print(villageTable
    %>% count(vname, name="count")
    %>% filter(count>1)
)

## # A tibble: 2 × 2
##   vname      count
##   <chr>      <int>
## 1 Daurawa        2
## 2 Sabon Gari     2

Making relational tables

If we had data like the parasiteTable, we would want to break it into

A table at the level of people
A table at the level of sampling events
A table at the level of measurements

This process leads to improved clarity, and improved cleaning

Parasite table

print(parasiteTable
    %>% count(id, village, compound, name="count")
)

## # A tibble: 13,656 × 4
##       id village compound count
##    <dbl>   <dbl>    <dbl> <int>
##  1     1     801      401    16
##  2     2     801      401    16
##  3     3     801      401    16
##  4     4     801      401    16
##  5     5     801      401    16
##  6     6     801      401    16
##  7     7     801      401    12
##  8     8     801      401    11
##  9     9     801      401    15
## 10     9     801      417     1
## # ℹ 13,646 more rows

Parasite table

print(parasiteTable
    %>% count(id, village, compound, name="count")
    %>% count(id, name="count")
    %>% filter(count>1)
    %>% arrange(desc(count))
)

## # A tibble: 756 × 2
##       id count
##    <dbl> <int>
##  1  6571     4
##  2  6572     4
##  3  6573     4
##  4  6574     4
##  5  6575     4
##  6  6576     4
##  7  6577     4
##  8  6578     4
##  9    41     3
## 10  1081     3
## # ℹ 746 more rows

Explore ranges

print(ggplot(villageTable, aes(x=population))
    + geom_histogram()
)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Correction tables

Mistakes

Let’s say you find an obvious error in your data

##   country continent count
## 1   Kenya    Africa    23
## 2  Uganda    Africa    31
## 3  Ungada    Africa     1
## 4 Vietnam      Asia    16

How should you fix it?

A table

A table:

respects the original data
records your decision
is transparent to yourself and others
fixes the mistake again, if it happens again

##   country patchcountry
## 1  Ungada       Uganda

Make a csv with an editor or spreadsheet program

Patching

Fix the problem as far “upstream” as possible
Not the count table (as I’m doing here)

patchTab <- (ctab
    %>% left_join(cortab)
)

## Joining with `by = join_by(country)`

print(patchTab)

##   country continent count patchcountry
## 1   Kenya    Africa    23         <NA>
## 2  Uganda    Africa    31         <NA>
## 3  Ungada    Africa     1       Uganda
## 4 Vietnam      Asia    16         <NA>

…

##   country continent count patchcountry
## 1   Kenya    Africa    23         <NA>
## 2  Uganda    Africa    31         <NA>
## 3  Ungada    Africa     1       Uganda
## 4 Vietnam      Asia    16         <NA>

NA logic is scary

NA means “Something I don’t know”. You don’t get the answers you expect if you compare things to NA:

"Jonathan" == NA

## [1] NA

Finish the patch, being careful

ctab <- (patchTab
    %>% mutate(country=ifelse(is.na(patchcountry), country, patchcountry))
    %>% select(-patchcountry)
)
print(ctab)

##   country continent count
## 1   Kenya    Africa    23
## 2  Uganda    Africa    31
## 3  Uganda    Africa     1
## 4 Vietnam      Asia    16

Be careful-er

ctab <- (patchTab
    %>% mutate(country=ifelse(is.na(patchcountry), country, 
        as.character(patchcountry)))
    %>% select(-patchcountry)
)
print(ctab)

##   country continent count
## 1   Kenya    Africa    23
## 2  Uganda    Africa    31
## 3  Uganda    Africa     1
## 4 Vietnam      Asia    16

Never be afraid to break pipes up or put them back together.

Factors

R likes to convert characters to factors
Factors can be useful, but they’re also often weird
Be aware of this as a possible problem
The more you use tools from the tidy family, the less factor weirdness you will encounter

Dictionary tables

You might want to summarize a factor variable into categories that you define
In this case you could use a similar table (Ungada \(\to\) Uganda), but different logic
- You only want to keep the new version
- … and check for cases where the original is not matched

Making dictionaries

Print out all the codes for a particular variable, and see what you think of them

print(religTab
    %>% select(code)
    %>% distinct()
)

## # A tibble: 62 × 1
##    code           
##    <chr>          
##  1 Catholic       
##  2 Orthodox       
##  3 Roman catholic 
##  4 Adventist      
##  5 Anglican       
##  6 Anglican Church
##  7 Apostolic Sect 
##  8 Aventist       
##  9 CCAP           
## 10 Charismatic    
## # ℹ 52 more rows

Coders can be funny

print(religTab
    %>% select(code)
    %>% distinct()
    %>% filter(grepl("^M", code))
)

## # A tibble: 9 × 1
##   code      
##   <chr>     
## 1 Methodist 
## 2 Moslem    
## 3 Muslem    
## 4 Mulsim    
## 5 Muslim    
## 6 Muslum    
## 7 Musualmane
## 8 Mana      
## 9 Mungu

Build a dictionary

You should start by editing the output from your previous step

print(religTab)

## # A tibble: 62 × 2
##    category          code           
##    <chr>             <chr>          
##  1 Catholic/Orthodox Catholic       
##  2 Catholic/Orthodox Orthodox       
##  3 Catholic/Orthodox Roman catholic 
##  4 Christian         Adventist      
##  5 Christian         Anglican       
##  6 Christian         Anglican Church
##  7 Christian         Apostolic Sect 
##  8 Christian         Aventist       
##  9 Christian         CCAP           
## 10 Christian         Charismatic    
## # ℹ 52 more rows

Using a dictionary

Start the same way as before.

##         person       religion
## 1 Ntombikayise         Mulsim
## 2    Gao Zhong       Anglican
## 3      Mafalda         Angler
## 4         Bill Roman catholic

patchTab <- (churchTab
    %>% left_join(religTab, by=c("religion"="code"))
) %>% print()

##         person       religion          category
## 1 Ntombikayise         Mulsim            Muslim
## 2    Gao Zhong       Anglican         Christian
## 3      Mafalda         Angler              <NA>
## 4         Bill Roman catholic Catholic/Orthodox

…

##         person       religion          category
## 1 Ntombikayise         Mulsim            Muslim
## 2    Gao Zhong       Anglican         Christian
## 3      Mafalda         Angler              <NA>
## 4         Bill Roman catholic Catholic/Orthodox

Dictionary replacement

But use different logic for replacement

catTab <- (patchTab
    %>% mutate(religion=category)
    %>% select(-category)
) %>% print()

##         person          religion
## 1 Ntombikayise            Muslim
## 2    Gao Zhong         Christian
## 3      Mafalda              <NA>
## 4         Bill Catholic/Orthodox

Play with your data

Don’t touch the original data
Make scripts and make sure that they are replicable
Don’t be afraid to play, experiment, probe
You want to like your data-manipulation tools, and get your data as clean as possible

Summary

Let the computer serve you
- Input in the form that you want to input
- Manage in the tidy form that is efficient and logical
- Output in the way you want to output
Be aggressive about exploring, questioning and cleaning