EDA automation with doctr

Caio Lente

2017-03-06

About

doctr is an R package that helps you check the consistency and the quality of data.

The goal of the package is, in other words, automating as much as possible the task of verifying if everything is ok with a dataset. Like a real doctor, it has functions for examining, diagnosing and assessing the progress of its “patients’”.

Since doctr was created with the Tidy Tools Manifesto in mind, it works perfectly alongiside the tidyverse.

Exploring datasets with doctr

One of doctr’s main fuctions is examine(), which gets the summary statistics for every column of a table, varying the summarization strategy depending on the type of variable.

After running examine(), we can use the report_*() family of functions to get the different types of reports back. report_num() is used for numeric varibales, report_txt() for text variables and report_fct() for factor variables.

Let’s see how this works with an example dataset: ggplot2::mpg. For the sake of this example, I’m going to transform the class column into a factor.

# Converting class to factor
mpg$class <- as.factor(mpg$class)

Now we have 3 main types of variables represented in this table: numeric, text and factor. When we run examine(), the function is going to treat each column differently depending on in which of these groups it fits; if it can’t classify the column, examine() always defaults to text.

# Creating the EDA
eda <- examine(mpg)

With the eda object we can get all 3 exploratory analyses.

# Getting report of numeric variables
report_num(eda)
## # A tibble: 5 × 26
##    name   len    min   max   `1%`   `5%`  `10%`  `20%`  `30%`  `40%`
##   <chr> <int>  <dbl> <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
## 1 displ   234    1.6     7    1.6    1.8    2.0    2.2    2.5    2.8
## 2  year   234 1999.0  2008 1999.0 1999.0 1999.0 1999.0 1999.0 1999.0
## 3   cyl   234    4.0     8    4.0    4.0    4.0    4.0    4.0    6.0
## 4   cty   234    9.0    35    9.0   11.0   11.0   13.0   14.0   15.0
## 5   hwy   234   12.0    44   12.0   15.0   16.3   17.0   19.0   22.0
## # ... with 16 more variables: `50%` <dbl>, `60%` <dbl>, `70%` <dbl>,
## #   `80%` <dbl>, `90%` <dbl>, `95%` <dbl>, `99%` <dbl>, mean <dbl>,
## #   sd <dbl>, na <dbl>, val <dbl>, neg <dbl>, zero <dbl>, pos <dbl>,
## #   unq <int>, mdp <dbl>
# Getting report of text variables
report_txt(eda)
## # A tibble: 5 × 25
##           name   len   min   max  `1%`  `5%` `10%` `20%` `30%` `40%` `50%`
##          <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 manufacturer   234     4    10     4   4.0     4     4     5     5     6
## 2        model   234     2    22     2   4.3     5     6     7    10    11
## 3        trans   234     8    10     8   8.0     8     8     8     8     8
## 4          drv   234     1     1     1   1.0     1     1     1     1     1
## 5           fl   234     1     1     1   1.0     1     1     1     1     1
## # ... with 14 more variables: `60%` <dbl>, `70%` <dbl>, `80%` <dbl>,
## #   `90%` <dbl>, `95%` <dbl>, `99%` <dbl>, mean <dbl>, sd <dbl>, na <dbl>,
## #   val <dbl>, unq <int>, asc <dbl>, ltr <dbl>, num <dbl>
# Getting report of factor variables
report_fct(eda)
## # A tibble: 7 × 4
##    name       data   cnt        frq
##   <chr>     <fctr> <int>      <dbl>
## 1 class    2seater     5 0.02136752
## 2 class    compact    47 0.20085470
## 3 class    midsize    41 0.17521368
## 4 class    minivan    11 0.04700855
## 5 class     pickup    33 0.14102564
## 6 class subcompact    35 0.14957265
## 7 class        suv    62 0.26495726

The tables produced are very wide, so I won’t show them here in their integrity, but the names of the columns in the reports are codes for each summary statistic; here’s what each of them mean and in which reports they come up:

column numeric text factor description
name x x x name of the variable
min, max x x minimum and maximum value/length
1%, …, 99% x x value/length percentiles
mean x x mean value/length
sd x x value/length standard deviation
na, val x x percentage of missing and non-missing entries
neg, zero, pos x percentage of negative, zero and positive values
unq x x count of unique values/texts
mdp x maximum number of decimal places
asc x equals 1 if the text is identified as ASCII
ltr, num x percentage of text that is identified as letters and numbers
data x each factor level
cnt, frq x count and frequency of each level

Grouping

Like with a group_by() statement, it is also possible to divide the table before getting the EDA. We do this with the group argument of the examine() function and then collect the results with the same argument of the report_*() family.

# Creating the EDA (grouped by the class variable)
eda <- examine(mpg, group = "class")

For examine(), group receives the name or index of a column. When collecting the reports, group receives the level of the grouped variable from which we want the results.

# Getting report of numeric variables for compact cars
report_num(eda, group = "compact")
## # A tibble: 5 × 26
##    name   len    min    max    `1%`   `5%`  `10%`   `20%` `30%` `40%`
##   <chr> <int>  <dbl>  <dbl>   <dbl>  <dbl>  <dbl>   <dbl> <dbl> <dbl>
## 1 displ    47    1.8    3.3    1.80    1.8    1.8    1.92     2     2
## 2  year    47 1999.0 2008.0 1999.00 1999.0 1999.0 1999.00  1999  1999
## 3   cyl    47    4.0    6.0    4.00    4.0    4.0    4.00     4     4
## 4   cty    47   15.0   33.0   15.00   16.0   16.6   18.00    18    19
## 5   hwy    47   23.0   44.0   23.46   24.3   25.0   25.20    26    27
## # ... with 16 more variables: `50%` <dbl>, `60%` <dbl>, `70%` <dbl>,
## #   `80%` <dbl>, `90%` <dbl>, `95%` <dbl>, `99%` <dbl>, mean <dbl>,
## #   sd <dbl>, na <dbl>, val <dbl>, neg <dbl>, zero <dbl>, pos <dbl>,
## #   unq <int>, mdp <dbl>
# Getting report of text variables for SUVs
report_txt(eda, group = "suv")
## # A tibble: 5 × 25
##           name   len   min   max  `1%`  `5%` `10%` `20%` `30%` `40%` `50%`
##          <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 manufacturer    62     4    10     4     4     4     4     5     6     6
## 2        model    62    11    22    11    11    11    11    12    12    13
## 3        trans    62     8    10     8     8     8     8     8     8     8
## 4          drv    62     1     1     1     1     1     1     1     1     1
## 5           fl    62     1     1     1     1     1     1     1     1     1
## # ... with 14 more variables: `60%` <dbl>, `70%` <dbl>, `80%` <dbl>,
## #   `90%` <dbl>, `95%` <dbl>, `99%` <dbl>, mean <dbl>, sd <dbl>, na <dbl>,
## #   val <dbl>, unq <int>, asc <dbl>, ltr <dbl>, num <dbl>
# Getting report of factor variables for midsize cars
report_fct(eda, group = "midsize")
## # A tibble: 1 × 4
##    name    data   cnt   frq
##   <chr>  <fctr> <int> <dbl>
## 1 class midsize    41     1