The skimr package is designed to provide summary statistics about variables. In base R the most similar functions are summary()
for vectors and data frames and fivenum()
for numeric vectors. Skimr is opinionated in its defaults but easy to modify.
For comparison purposes here are examples of the similar functions.
summary(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
summary(iris$Sepal.Length)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.300 5.100 5.800 5.843 6.400 7.900
fivenum(iris$Sepal.Length)
## [1] 4.3 5.1 5.8 6.4 7.9
summary(iris$Species)
## setosa versicolor virginica
## 50 50 50
The core function of skimr is skim(). Skim is a S3 generic function; the skim package includes support for data frames and grouped data frames. Like summary for data frames, skim presents results for all of the columns and the statistics depend on the class of the variable.
However, unlike summary.data.frame(), the printed results (those displayed in the console or in a knitted markdown file) are shown horizontally with one row per variable and separated into separate tibbles for each class of variables. The actual results are stored in a skim_df
object that is also a tibble In summary.data.frame()
the statistics are stored in a table
with one column for each variable and the standard table printing is used to display the results.
library(skimr)
skim(iris)
## Skim summary statistics
## n obs: 150
## n variables: 5
##
## Variable type: factor
## variable missing complete n n_unique top_counts
## Species 0 150 150 3 set: 50, ver: 50, vir: 50, NA: 0
## ordered
## FALSE
##
## Variable type: numeric
## variable missing complete n mean sd p0 p25 median p75 p100
## Petal.Length 0 150 150 3.76 1.77 1 1.6 4.35 5.1 6.9
## Petal.Width 0 150 150 1.2 0.76 0.1 0.3 1.3 1.8 2.5
## Sepal.Length 0 150 150 5.84 0.83 4.3 5.1 5.8 6.4 7.9
## Sepal.Width 0 150 150 3.06 0.44 2 2.8 3 3.3 4.4
## hist
## ▇▁▁▂▅▅▃▁
## ▇▁▁▅▃▃▂▂
## ▂▇▅▇▆▅▂▂
## ▁▂▅▇▃▂▁▁
This distinction is important because the skim_df object is easy to use for additional manipulation if desired and is pipeable. For example all of the results for a particular statistic or for one variable could be selected or an alternative printing method sould be developed.
The skim_df
object always contains 6 columns:
variable
the name of the original variabletype
the class of the variablestat
the statistic calculated with the name becoming the column name when the object is printed.level
is used when summary functions returns multiple values when skimming. This happens as for counts of levels for factor variables or when setting multiple values to the probs
argument of the quantiles
function.value
is the actual calculated value of the statistic and should be used for further calculations. This is always numeric.formatted
is a formatted version of value that attempts to use reasonable number of digits (decimal aligned) and put items such as dates into human readable formats. It is a character variable.s <- skim(iris)
head(s, 15)
## # A tibble: 15 x 6
## variable type stat level value formatted
## <chr> <chr> <chr> <chr> <dbl> <chr>
## 1 Sepal.Length numeric missing .all 0 0
## 2 Sepal.Length numeric complete .all 150 150
## 3 Sepal.Length numeric n .all 150 150
## 4 Sepal.Length numeric mean .all 5.84 5.84
## 5 Sepal.Length numeric sd .all 0.828 0.83
## 6 Sepal.Length numeric p0 .all 4.30 4.3
## 7 Sepal.Length numeric p25 .all 5.10 5.1
## 8 Sepal.Length numeric median .all 5.80 5.8
## 9 Sepal.Length numeric p75 .all 6.40 6.4
## 10 Sepal.Length numeric p100 .all 7.90 7.9
## 11 Sepal.Length numeric hist .all NA ▂▇▅▇▆▅▂▂
## 12 Sepal.Width numeric missing .all 0 0
## 13 Sepal.Width numeric complete .all 150 150
## 14 Sepal.Width numeric n .all 150 150
## 15 Sepal.Width numeric mean .all 3.06 3.06
skim()
also supports grouped data. For grouped data one additional column for each grouping variable is added to the skim object.
mtcars %>%
dplyr::group_by(gear) %>%
skim()
## Skim summary statistics
## n obs: 32
## n variables: 11
## group variables: gear
##
## Variable type: numeric
## gear variable missing complete n mean sd p0 p25 median
## 3 am 0 15 15 0 0 0 0 0
## 3 carb 0 15 15 2.67 1.18 1 2 3
## 3 cyl 0 15 15 7.47 1.19 4 8 8
## 3 disp 0 15 15 326.3 94.85 120.1 275.8 318
## 3 drat 0 15 15 3.13 0.27 2.76 3.04 3.08
## 3 hp 0 15 15 176.13 47.69 97 150 180
## 3 mpg 0 15 15 16.11 3.37 10.4 14.5 15.5
## 3 qsec 0 15 15 17.69 1.35 15.41 17.04 17.42
## 3 vs 0 15 15 0.2 0.41 0 0 0
## 3 wt 0 15 15 3.89 0.83 2.46 3.45 3.73
## 4 am 0 12 12 0.67 0.49 0 0 1
## 4 carb 0 12 12 2.33 1.3 1 1 2
## 4 cyl 0 12 12 4.67 0.98 4 4 4
## 4 disp 0 12 12 123.02 38.91 71.1 78.92 130.9
## 4 drat 0 12 12 4.04 0.31 3.69 3.9 3.92
## 4 hp 0 12 12 89.5 25.89 52 65.75 94
## 4 mpg 0 12 12 24.53 5.28 17.8 21 22.8
## 4 qsec 0 12 12 18.96 1.61 16.46 18.46 18.75
## 4 vs 0 12 12 0.83 0.39 0 1 1
## 4 wt 0 12 12 2.62 0.63 1.61 2.13 2.7
## 5 am 0 5 5 1 0 1 1 1
## 5 carb 0 5 5 4.4 2.61 2 2 4
## 5 cyl 0 5 5 6 2 4 4 6
## 5 disp 0 5 5 202.48 115.49 95.1 120.3 145
## 5 drat 0 5 5 3.92 0.39 3.54 3.62 3.77
## 5 hp 0 5 5 195.6 102.83 91 113 175
## 5 mpg 0 5 5 21.38 6.66 15 15.8 19.7
## 5 qsec 0 5 5 15.64 1.13 14.5 14.6 15.5
## 5 vs 0 5 5 0.2 0.45 0 0 0
## 5 wt 0 5 5 2.63 0.82 1.51 2.14 2.77
## p75 p100 hist
## 0 0 ▁▁▁▇▁▁▁▁
## 4 4 ▅▁▆▁▁▅▁▇
## 8 8 ▁▁▁▁▁▁▁▇
## 380 472 ▂▁▂▇▃▆▂▆
## 3.18 3.73 ▃▃▇▆▁▁▁▃
## 210 245 ▅▁▃▁▇▂▂▅
## 18.4 21.5 ▃▁▃▇▃▃▂▃
## 17.99 20.22 ▃▁▆▇▆▁▂▃
## 0 1 ▇▁▁▁▁▁▁▂
## 3.96 5.42 ▁▁▇▅▁▁▁▃
## 1 1 ▃▁▁▁▁▁▁▇
## 4 4 ▇▁▇▁▁▁▁▇
## 6 6 ▇▁▁▁▁▁▁▃
## 160 167.6 ▇▁▁▂▂▂▂▇
## 4.09 4.93 ▁▇▃▁▁▁▁▁
## 110 123 ▂▇▁▁▃▁▆▃
## 28.08 33.9 ▅▇▅▂▂▁▂▅
## 19.58 22.9 ▃▁▇▆▃▁▁▂
## 1 1 ▂▁▁▁▁▁▁▇
## 3.16 3.44 ▇▃▃▃▃▇▇▇
## 1 1 ▁▁▁▇▁▁▁▁
## 6 8 ▇▁▃▁▁▃▁▃
## 8 8 ▇▁▁▃▁▁▁▇
## 301 351 ▇▃▁▁▁▁▃▃
## 4.22 4.43 ▇▁▃▁▁▁▃▃
## 264 335 ▇▁▃▁▁▃▁▃
## 26 30.4 ▇▁▃▁▁▃▁▃
## 16.7 16.9 ▇▁▁▃▁▁▁▇
## 0 1 ▇▁▁▁▁▁▁▂
## 3.17 3.57 ▇▁▇▁▇▁▇▇
Individual columns from a data frame may be selected using tidyverse style selectors.
skim(iris, Sepal.Length, Species)
## Skim summary statistics
## n obs: 150
## n variables: 5
##
## Variable type: factor
## variable missing complete n n_unique top_counts
## Species 0 150 150 3 set: 50, ver: 50, vir: 50, NA: 0
## ordered
## FALSE
##
## Variable type: numeric
## variable missing complete n mean sd p0 p25 median p75 p100
## Sepal.Length 0 150 150 5.84 0.83 4.3 5.1 5.8 6.4 7.9
## hist
## ▂▇▅▇▆▅▂▂
skim(iris, starts_with("Sepal"))
## Skim summary statistics
## n obs: 150
## n variables: 5
##
## Variable type: numeric
## variable missing complete n mean sd p0 p25 median p75 p100
## Sepal.Length 0 150 150 5.84 0.83 4.3 5.1 5.8 6.4 7.9
## Sepal.Width 0 150 150 3.06 0.44 2 2.8 3 3.3 4.4
## hist
## ▂▇▅▇▆▅▂▂
## ▁▂▅▇▃▂▁▁
If an individual column is of an unsuppported class it is treated as a character variable with a warning.
The skim() function for a data frame returns a long, six column data frame. This long data frame is printed horizontally as a separate summary for each data type found in the data frame, but the object itself is not transformed during the print.
Three other functions are available that may prove useful as part of skim work flows.
The skim_tee()
function produces the same printed version as skim()
but returns the unmodified data frame. This allows for continued piping of the original data.
The skim_to_list()
funtion returns of a list of the wide data frames for each data type. The data frames contain the formatted values, meaning that they are character data and most useful for display. In general users will want to store the results in an object for further handling.
The skim_to_wide()
function returns a single data frame with each variable in a row. Variables that do not report a given statistic are assigned NA for that statistic. Formatted values are returned and all data are character.
The skim
function also handles individual vectors that are not part of a data frame. For example the lynx
data set is class ts
.
skim(datasets::lynx)
## Skim summary statistics
##
## Variable type: ts
## variable missing complete n start end frequency deltat mean
## datasets::lynx 0 114 114 1821 1934 1 1 1538.02
## sd min max median line_graph
## 1585.84 39 6991 771 ⡈⢄⡠⢁⣀⠒⣀⠔
If you attempt to use skim
on a class that does not have support, it will coerce it to character (with a warning) and report number of NA
s, number complete (non missing), number of rows, the number empty (i.e. “”), minimum length of non empty strings, maximum length of non empty strings, and number of unique values.
lynx <- datasets::lynx
class(lynx) <- "unkown_class"
skim(lynx)
## Warning: No summary functions for vectors of class: unkown_class.
## Coercing to character
## Skim summary statistics
##
## Variable type: character
## variable missing complete n min max empty n_unique
## lynx 0 114 114 2 4 0 110
Skimr is opinionated in its choice of defaults, but users can easily add too, replace, or remove the statistics for a class.
To add a statistic use the a named list for each class using the format below.
classname = list(mad_name = mad)
skim_with(numeric = list(mad_name = mad))
skim(datasets::chickwts)
## Skim summary statistics
## n obs: 71
## n variables: 2
##
## Variable type: factor
## variable missing complete n n_unique top_counts
## feed 0 71 71 6 soy: 14, cas: 12, lin: 12, sun: 12
## ordered
## FALSE
##
## Variable type: numeric
## variable missing complete n mean sd p0 p25 median p75 p100
## weight 0 71 71 261.31 78.07 108 204.5 258 323.5 423
## hist mad_name
## ▃▅▅▇▃▇▂▂ 91.92
The skim_with_defaults()
function resets the list to the defaults. By default skim_with()
appends the new statstics, but setting append = FALSE
replaces the defaults.
skim_with_defaults()
skim_with(numeric = list(mad_name = mad), append = FALSE)
skim(datasets::chickwts)
## Skim summary statistics
## n obs: 71
## n variables: 2
##
## Variable type: factor
## variable missing complete n n_unique top_counts
## feed 0 71 71 6 soy: 14, cas: 12, lin: 12, sun: 12
## ordered
## FALSE
##
## Variable type: numeric
## variable mad_name
## weight 91.92
skim_with_defaults() # Reset to defaults
You can also use skim_with()
to remove specific statistics by setting them to NULL
.
skim_with(numeric = list(hist = NULL))
skim(datasets::chickwts)
## Skim summary statistics
## n obs: 71
## n variables: 2
##
## Variable type: factor
## variable missing complete n n_unique top_counts
## feed 0 71 71 6 soy: 14, cas: 12, lin: 12, sun: 12
## ordered
## FALSE
##
## Variable type: numeric
## variable missing complete n mean sd p0 p25 median p75 p100
## weight 0 71 71 261.31 78.07 108 204.5 258 323.5 423
skim_with_defaults() #
Skimr does opinionated formatting of the statistics displayed when printing. These values are stored in the formatted
column of the skim_df object and are always character. Skim attempts to use a reasonable number of decimal places for calculated values based on the data type (integer or numeric) and number of stored decimals. For statistics such as max()
and min()
the actual stored values are displayed. Decimals in a column are aligned. Date formats are used for date statistics.
Users override the formats using the skim_format()
function. Using show_formats()
will display the current options in use for each data type. Using skim_format_defaults()
will reset the formats to their default settings.
The skim_df object is a long data frame with one row for each combination of variable and statistic (and optionally for group). The horizontal display is created by default using print.skim_df()
. This can be called explicitly by applying the print()
function to a skim_df
object which allows passing in of options. In addition kable()
andpander()
are supported. These both provide more control over the rendered results, particularly when used to render in conjunction with knitr. Documentation of these options for these functions is covered in more detail in the knitr
package for kable()
and the pander
package for pander()
. Using either of these may require use of document or chunk options and fonts.
kable()
use in a markdown file, use a chunk option of results='asis'
.pander()
use in a markdown file, use a chunk option of results='asis'
. To prevent using asis by default use panderOptons()
to set it to FALSE.This topic is addressed in more detail in the Using Fonts vignette.
skim(iris) %>% kable()
Skim summary statistics
n obs: 150
n variables: 5
Variable type: factor
variable | missing | complete | n | n_unique | top_counts | ordered |
---|---|---|---|---|---|---|
Species | 0 | 150 | 150 | 3 | set: 50, ver: 50, vir: 50, NA: 0 | FALSE |
Variable type: numeric
variable | missing | complete | n | mean | sd | p0 | p25 | median | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|---|
Petal.Length | 0 | 150 | 150 | 3.76 | 1.77 | 1 | 1.6 | 4.35 | 5.1 | 6.9 | ▇▁▁▂▅▅▃▁ |
Petal.Width | 0 | 150 | 150 | 1.2 | 0.76 | 0.1 | 0.3 | 1.3 | 1.8 | 2.5 | ▇▁▁▅▃▃▂▂ |
Sepal.Length | 0 | 150 | 150 | 5.84 | 0.83 | 4.3 | 5.1 | 5.8 | 6.4 | 7.9 | ▂▇▅▇▆▅▂▂ |
Sepal.Width | 0 | 150 | 150 | 3.06 | 0.44 | 2 | 2.8 | 3 | 3.3 | 4.4 | ▁▂▅▇▃▂▁▁ |
library(pander)
panderOptions('knitr.auto.asis', FALSE)
skim(iris) %>% pander()
## Skim summary statistics
## n obs: 150
## n variables: 5
##
## ------------------------------------------------
## variable missing complete n n_unique
## ---------- --------- ---------- ----- ----------
## Species 0 150 150 3
## ------------------------------------------------
##
## Table: Table continues below
##
##
## ------------------------------------------
## top_counts ordered
## -------------------------------- ---------
## set: 50, ver: 50, vir: 50, NA: FALSE
## 0
## ------------------------------------------
##
##
## ----------------------------------------------------------------------------
## variable missing complete n mean sd p0 p25 median
## -------------- --------- ---------- ----- ------ ------ ----- ----- --------
## Petal.Length 0 150 150 3.76 1.77 1 1.6 4.35
##
## Petal.Width 0 150 150 1.2 0.76 0.1 0.3 1.3
##
## Sepal.Length 0 150 150 5.84 0.83 4.3 5.1 5.8
##
## Sepal.Width 0 150 150 3.06 0.44 2 2.8 3
## ----------------------------------------------------------------------------
##
## Table: Table continues below
##
##
## -----------------------
## p75 p100 hist
## ----- ------ ----------
## 5.1 6.9 ▇▁▁▂▅▅▃▁
##
## 1.8 2.5 ▇▁▁▅▃▃▂▂
##
## 6.4 7.9 ▂▇▅▇▆▅▂▂
##
## 3.3 4.4 ▁▂▅▇▃▂▁▁
## -----------------------
The details of rendering are dependent on the operating system R is running on, the locale of the installation, and the fonts installed. Rendering may also differ based on whether it occurs in the console or when knitting to specific types of documents such as HTML and PDF. The most commonly reported problems involve rendering the spark graphs (inline histogram). This section will summarize known issues.
Currently pander() does not support inline_histograms on Windows. Also, Windows does not support spark line graphs.
In order to render the spark graphs in html or PDF histogram you may need to change fonts to one that supports blocks or braille (depending on which you need). Please review the separate vignette and associated template for details on this.