Dataset diagnostics with doctr

Caio Lente

2017-03-06

About

doctr is an R package that helps you check the consistency and the quality of data.

The goal of the package is, in other words, automating as much as possible the task of verifying if everything is ok with a dataset. Like a real doctor, it has functions for examining, diagnosing and assessing the progress of its “patients’”.

Since doctr was created with the Tidy Tools Manifesto in mind, it works perfectly alongiside the tidyverse.

Creating dataset diagnostics with doctr

One of doctr’s main fuctions is diagnose(), which runs tests (nicknamed “exams”) on a table to check if its variables pass certain standards and fit certain assumptions.

After running diagnose(), we can use the issues() function to get a report about the results of the exams.

Let’s see how this works with an example dataset: ggplot2::mpg.

# Runninng exams on table
diagnostics <- diagnose(mpg)

Now the diagnostics object contains all the errors found while diagnosing the mpg dataset. By using issues() we can get human-readable reports on these errors.

# Getting summary of diagnostics
issues(diagnostics)
## No issues found in 'manufacturer'
## No issues found in 'model'
## No issues found in 'displ'
## No issues found in 'year'
## No issues found in 'cyl'
## No issues found in 'trans'
## No issues found in 'drv'
## No issues found in 'cty'
## No issues found in 'hwy'
## No issues found in 'fl'
## No issues found in 'class'

Since mpg is already very well-formed, no issues were found. I’m going to artificially break the table so we can see what issues look like (I’m also turning verbose on so the function shows exactly what the issues were).

# Manually breaking mpg
mpg2 <- mpg %>%
  mutate(year = as.Date(year, origin = "1970-01-01"))

# Getting summary of diagnostics
mpg2 %>% diagnose() %>% issues(verbose = TRUE)
## No issues found in 'manufacturer'
## No issues found in 'model'
## No issues found in 'displ'
## Issues found in 'year'
##     Data isn't of type factor
## No issues found in 'cyl'
## No issues found in 'trans'
## No issues found in 'drv'
## No issues found in 'cty'
## No issues found in 'hwy'
## No issues found in 'fl'
## No issues found in 'class'

As we can see, diagnose() was able to parse year, but it alerted us that it isn’t a character variable.

Creating custom exams

diagnose() by default uses a function called guess_exams() to generate the exams it is going to run on a given table. This special function grabs a sample of the table and tries to assign each of its variables to one of the types below (from most to least restrictive):

If you run guess_exams() by yourself, you can customize the exams it generates and then pass them as an argument to diganose() so that it uses you custom exams.

Let’s see how this works in practice.

exams <- guess_exams(mpg)
cols funs max_na min_val max_val max_dec_places min_unq max_unq least_frec_cls
manufacturer character
model character
displ quantity
year count
cyl count
trans character
drv character
cty count
hwy count
fl character
class character

Each columns in exams can be filled with a parameter that is going to be used by diagnose() to find problems in mpg. These are the meanings of these parameters and to what variable types they apply (for more information on types, run vignette(doctr_examine)):

parameter numeric text factor description
funs x x x which type should be used for base exams (percentage, money, etc.)
max_na x x x maximum % of NAs
min_val, max_val x minimum and maximum values
max_dec_places x maximum number of decimal places
min_unq, max_unq x x minimum and maximum number of unique classes
least_freq_cls x x minimum % of the total a class can represent

Let’s customize these exams and use them with diagnose().

# Setting some arbritraty maximum and minimum values
exams$max_val[8] <- 30
exams$min_val[9] <- 15

# Setting least frequent class
exams$least_frec_cls[10] <- 0.2

# Setting maximum unique classes
exams$max_unq[1] <- 10

# Use custom exams to diagnose table
mpg %>% diagnose(exams) %>% issues()
## Issues found in 'manufacturer'
## No issues found in 'model'
## No issues found in 'displ'
## No issues found in 'year'
## No issues found in 'cyl'
## No issues found in 'trans'
## No issues found in 'drv'
## Issues found in 'cty'
## Issues found in 'hwy'
## Issues found in 'fl'
## No issues found in 'class'

Using the i parameter of issues() paired with verbose, we can pass the name or index of a column in order to get only the issues associated with it.

# Use custom exams to diagnose table
diagnostics <- diagnose(mpg, exams)

# Get results for 1st column
issues(diagnostics, i = 1, verbose = TRUE)
## Issues found in 'manufacturer'
##     There are more than 10 unique classes
# Get results for fl column
issues(diagnostics, i = "fl", verbose = TRUE)
## Issues found in 'fl'
##     There are 3 classes that represent less than 20% of the total