doctr
is an R package that helps you check the consistency and the quality of data.
The goal of the package is, in other words, automating as much as possible the task of verifying if everything is ok with a dataset. Like a real doctor, it has functions for examining, diagnosing and assessing the progress of its “patients’”.
Since doctr
was created with the Tidy Tools Manifesto in mind, it works perfectly alongiside the tidyverse.
One of doctr
’s main functions is compare()
, which compares the profiles of two tables, checking if they can be considered similar enough. This is very useful when we’re dealing with the evolution of a table over time, e.g. we receive some data gathered in January and then some data gathered in February.
After running diagnose()
, we can use the issues()
function to get a report about the results of the comparison.
Let’s see how this works with an example dataset: ggplot2::mpg
. Since we don’t have multiple versions of mpg
, I’m going to use the full dataset as the “January” version and a random sample as the “February” version.
# Creating aritificial versions of the dataset
mpg_jan <- mpg
mpg_feb <- sample_n(mpg, 100)
# Comparing mpg_jan and mpg_feb
comparison <- compare(mpg_jan, mpg_feb)
Now the comparison
object contains all the errors found while comparing the two datasets. By using issues()
we can get human-readable reports on these errors.
# Getting summary of comparison
issues(comparison)
## No issues found in 'manufacturer'
## Issues found in 'model'
## Issues found in 'displ'
## Issues found in 'year'
## Issues found in 'cyl'
## No issues found in 'trans'
## Issues found in 'drv'
## Issues found in 'cty'
## Issues found in 'hwy'
## No issues found in 'fl'
## No issues found in 'class'
Using the i
parameter of issues()
paired with verbose
, we can pass the name or index of a column in order to get only the issues associated with it.
# Get results for 1st column
issues(comparison, i = 3, verbose = TRUE)
## Issues found in 'displ'
## New value for '5%' is too high
## New value for '20%' is too high
## New value for '40%' is too high
## New value for '50%' is too high
## New value for '99%' is too low
# Get results for fl column
issues(comparison, i = "hwy", verbose = TRUE)
## Issues found in 'hwy'
## New value for '50%' is too low
## New value for 'mean' is too low
There are many issues that can arise during a comparison, each being a code for a summary statistic of examine()
(for more information see vignette("doctr_examine")
) and if that specific value was considered too low or too high; here’s what each of these codes mean and for which types of variables they come up:
column | numeric | text | factor | description |
---|---|---|---|---|
min , max |
x | x | minimum and maximum value/length | |
1% , …, 99% |
x | x | value/length percentiles | |
mean |
x | x | mean value/length | |
sd |
x | x | value/length standard deviation | |
na , val |
x | x | percentage of missing and non-missing entries | |
neg , zero , pos |
x | percentage of negative, zero and positive values | ||
unq |
x | x | count of unique values/texts | |
mdp |
x | maximum number of decimal places | ||
asc |
x | equals 1 if the text is identified as ASCII | ||
ltr , num |
x | percentage of text that is identified as letters and numbers | ||
data |
x | each factor level | ||
cnt , frq |
x | count and frequency of each level |
It is also possible to make the comparison more or less sensitive with different ci
(confidence intervals) values.
mpg_jan %>% compare(mpg_feb, ci = 0.5) %>% issues()
## Issues found in 'manufacturer'
## Issues found in 'model'
## Issues found in 'displ'
## Issues found in 'year'
## Issues found in 'cyl'
## Issues found in 'trans'
## Issues found in 'drv'
## Issues found in 'cty'
## Issues found in 'hwy'
## No issues found in 'fl'
## No issues found in 'class'