collapse is a C/C++ based package for data transformation and statistical computing in R. It’s aims are:
This vignette focuses on the integration of collapse and the popular dplyr package by Hadley Wickham. In particular it will demonstrate how using collapse’s fast functions and some fast alternatives for dplyr verbs can substantially facilitate and speed up basic data manipulation, grouped and weighted aggregations and transformations, and panel data computations (i.e. between- and within-transformations, panel-lags, differences and growth rates) in a dplyr (piped) workflow.
Notes:
This vignette is targeted at dplyr / tidyverse users. collapse is a standalone package and can be programmed efficiently without pipes or dplyr verbs.
The ‘Introduction to collapse’ vignette provides a thorough introduction to the package and a built-in structured documentation is available under help("collapse-documentation")
after installing the package. In addition help("collapse-package")
provides a compact set of examples for quick-start.
Documentation and vignettes can also be viewed online.
A key feature of collapse is it’s broad set of Fast Statistical Functions (fsum, fprod, fmean, fmedian, fmode, fvar, fsd, fmin, fmax, fnth, ffirst, flast, fNobs, fNdistinct
) which are able to substantially speed-up column-wise, grouped and weighted computations on vectors, matrices or data frames. The functions are S3 generic, with a default (vector), matrix and data frame method, as well as a grouped_df method for grouped tibbles used by dplyr. The grouped tibble method has the following arguments:
FUN.grouped_df(x, [w = NULL,] TRA = NULL, [na.rm = TRUE,]
use.g.names = FALSE, keep.group_vars = TRUE, [keep.w = TRUE,] ...)
where w
is a weight variable, and TRA
and can be used to transform x
using the computed statistics and one of 10 available transformations ("replace_fill", "replace", "-", "-+", "/", "%", "+", "*", "%%", "-%%"
, discussed in section 2). na.rm
efficiently removes missing values and is TRUE
by default. use.g.names
generates new row-names from the unique combinations of groups (default: disabled), whereas keep.group_vars
(default: enabled) will keep the grouping columns as is custom in the native data %>% group_by(...) %>% summarize(...)
workflow in dplyr. Finally, keep.w
regulates whether a weighting variable used is also aggregated and saved in a column. For fsum, fmean, fmedian, fnth, fvar, fsd
and fmode
this will compute the sum of the weights in each group, whereas fprod
returns the product of the weights.
With that in mind, let’s consider some straightforward applications.
Consider the Groningen Growth and Development Center 10-Sector Database included in collapse and introduced in the main vignette:
library(collapse)
head(GGDC10S)
# Country Regioncode Region Variable Year AGR MIN MAN PU
# 1 BWA SSA Sub-saharan Africa VA 1960 NA NA NA NA
# 2 BWA SSA Sub-saharan Africa VA 1961 NA NA NA NA
# 3 BWA SSA Sub-saharan Africa VA 1962 NA NA NA NA
# 4 BWA SSA Sub-saharan Africa VA 1963 NA NA NA NA
# 5 BWA SSA Sub-saharan Africa VA 1964 16.30154 3.494075 0.7365696 0.1043936
# 6 BWA SSA Sub-saharan Africa VA 1965 15.72700 2.495768 1.0181992 0.1350976
# CON WRT TRA FIRE GOV OTH SUM
# 1 NA NA NA NA NA NA NA
# 2 NA NA NA NA NA NA NA
# 3 NA NA NA NA NA NA NA
# 4 NA NA NA NA NA NA NA
# 5 0.6600454 6.243732 1.658928 1.119194 4.822485 2.341328 37.48229
# 6 1.3462312 7.064825 1.939007 1.246789 5.695848 2.678338 39.34710
# Summarize the Data:
# descr(GGDC10S, cols = is.categorical)
# aperm(qsu(GGDC10S, ~Variable, cols = is.numeric))
Simple column-wise computations using the fast functions and pipe operators are performed as follows:
library(dplyr)
GGDC10S %>% fNobs # Number of Observations
# Country Regioncode Region Variable Year AGR MIN MAN PU
# 5027 5027 5027 5027 5027 4364 4355 4355 4354
# CON WRT TRA FIRE GOV OTH SUM
# 4355 4355 4355 4355 3482 4248 4364
GGDC10S %>% fNdistinct # Number of distinct values
# Country Regioncode Region Variable Year AGR MIN MAN PU
# 43 6 6 2 67 4353 4224 4353 4237
# CON WRT TRA FIRE GOV OTH SUM
# 4339 4344 4334 4349 3470 4238 4364
GGDC10S %>% select_at(6:16) %>% fmedian # Median
# AGR MIN MAN PU CON WRT TRA FIRE GOV
# 4394.5194 173.2234 3718.0981 167.9500 1473.4470 3773.6430 1174.8000 960.1251 3928.5127
# OTH SUM
# 1433.1722 23186.1936
GGDC10S %>% select_at(6:16) %>% fmean # Mean
# AGR MIN MAN PU CON WRT TRA FIRE GOV
# 2526696.5 1867908.9 5538491.4 335679.5 1801597.6 3392909.5 1473269.7 1657114.8 1712300.3
# OTH SUM
# 1684527.3 21566436.8
GGDC10S %>% fmode # Mode
# Country Regioncode Region Variable Year
# "USA" "ASI" "Asia" "EMP" "2010"
# AGR MIN MAN PU CON
# "171.315882316326" "0" "4645.12507642586" "0" "1.34623115930777"
# WRT TRA FIRE GOV OTH
# "21.8380052682527" "8.97743416914571" "40.0701608636442" "0" "3626.84423577048"
# SUM
# "37.4822945751317"
GGDC10S %>% fmode(drop = FALSE) # Keep data structure intact
# Country Regioncode Region Variable Year AGR MIN MAN PU CON WRT TRA
# 1 USA ASI Asia EMP 2010 171.3159 0 4645.125 0 1.346231 21.83801 8.977434
# FIRE GOV OTH SUM
# 1 40.07016 0 3626.844 37.48229
Moving on to grouped statistics, we can compute the average value added and employment by sector and country using:
GGDC10S %>%
group_by(Variable, Country) %>%
select_at(6:16) %>% fmean
# # A tibble: 85 x 13
# Variable Country AGR MIN MAN PU CON WRT TRA FIRE GOV OTH SUM
# <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 EMP ARG 1420. 52.1 1932. 1.02e2 7.42e2 1.98e3 6.49e2 628. 2043. 9.92e2 1.05e4
# 2 EMP BOL 964. 56.0 235. 5.35e0 1.23e2 2.82e2 1.15e2 44.6 NA 3.96e2 2.22e3
# 3 EMP BRA 17191. 206. 6991. 3.65e2 3.52e3 8.51e3 2.05e3 4414. 5307. 5.71e3 5.43e4
# 4 EMP BWA 188. 10.5 18.1 3.09e0 2.53e1 3.63e1 8.36e0 15.3 61.1 2.76e1 3.94e2
# 5 EMP CHL 702. 101. 625. 2.94e1 2.96e2 6.95e2 2.58e2 272. NA 1.00e3 3.98e3
# 6 EMP CHN 287744. 7050. 67144. 1.61e3 2.09e4 2.89e4 1.39e4 4929. 22669. 3.10e4 4.86e5
# 7 EMP COL 3091. 145. 1175. 3.39e1 5.24e2 2.07e3 4.70e2 649. NA 1.73e3 9.89e3
# 8 EMP CRI 231. 1.70 136. 1.43e1 5.76e1 1.57e2 4.24e1 54.9 128. 6.51e1 8.87e2
# 9 EMP DEW 2490. 407. 8473. 2.26e2 2.09e3 4.44e3 1.48e3 1689. 3945. 9.99e2 2.62e4
# 10 EMP DNK 236. 8.03 507. 1.38e1 1.71e2 4.55e2 1.61e2 181. 549. 1.11e2 2.39e3
# # ... with 75 more rows
Similarly we can aggregate using any other of the above functions.
It is important to not use dplyr’s summarize
together with these functions since that would eliminate their speed gain. These functions are fast because they are executed only once and carry out the grouped computations in C++, whereas summarize
will apply the function to each group in the grouped tibble.
To better explain this point it is perhaps good to shed some light on what is happening behind the scenes of dplyr and collapse. Fundamentally both packages follow different computing paradigms:
dplyr is an efficient implementation of the Split-Apply-Combine computing paradigm. Data is split into groups, these data-chunks are then passed to a function carrying out the computation, and finally recombined to produce the aggregated data.frame. This modus operandi is evident in the grouping mechanism of dplyr. When a data.frame is passed through group_by, a ‘groups’ attribute is attached:
GGDC10S %>% group_by(Variable, Country) %>% attr("groups")
# # A tibble: 85 x 3
# Variable Country .rows
# * <chr> <chr> <list<int>>
# 1 EMP ARG [62]
# 2 EMP BOL [61]
# 3 EMP BRA [62]
# 4 EMP BWA [52]
# 5 EMP CHL [63]
# 6 EMP CHN [62]
# 7 EMP COL [61]
# 8 EMP CRI [62]
# 9 EMP DEW [61]
# 10 EMP DNK [64]
# # ... with 75 more rows
This object is a data.frame giving the unique groups and in the third (last) column vectors containing the indices of the rows belonging to that group. A command like summarize
uses this information to split the data.frame into groups which are then passed sequentially to the function used and later recombined. These steps are also done in C++ which makes dplyr quite efficient.
Now collapse is based around one-pass grouped computations at the C++ level using its own grouped statistical functions. In other words the data is not split and recombined at all but the entire computation is performed in a single C++ loop running through that data and completing the computations for each group simultaneously. This modus operandi is also evident in collapse grouping objects. The method GRP.grouped_df
takes a dplyr grouping object from a grouped tibble and efficiently converts it to a collapse grouping object:
GGDC10S %>% group_by(Variable, Country) %>% GRP %>% str
# List of 8
# $ N.groups : int 85
# $ group.id : int [1:5027] 46 46 46 46 46 46 46 46 46 46 ...
# $ group.sizes: int [1:85] 62 61 62 52 63 62 61 62 61 64 ...
# $ groups :List of 2
# ..$ Variable: chr [1:85] "EMP" "EMP" "EMP" "EMP" ...
# .. ..- attr(*, "label")= chr "Variable"
# .. ..- attr(*, "format.stata")= chr "%9s"
# ..$ Country : chr [1:85] "ARG" "BOL" "BRA" "BWA" ...
# .. ..- attr(*, "label")= chr "Country"
# .. ..- attr(*, "format.stata")= chr "%9s"
# $ group.vars : chr [1:2] "Variable" "Country"
# $ ordered : logi [1:2] TRUE TRUE
# $ order : NULL
# $ call : language GRP.grouped_df(X = .)
# - attr(*, "class")= chr "GRP"
This object is a list where the first three elements give the number of groups, the group-id to which each row belongs and a vector of group-sizes. A function like fsum
uses this information to (for each column) create a result vector of size ‘N.groups’ and the run through the column using the ‘group.id’ vector to add the i’th data point to the ’group.id[i]’th element of the result vector. When the loop is finished, the grouped computation is also finished.
It is obvious that collapse is faster than dplyr since it’s method of computing involves less steps, and it does not need to call statistical functions multiple times. See the benchmark section.
collapse fast functions do not develop their maximal performance on a grouped tibble created with group_by
because of the additional conversion cost of the grouping object incurred by GRP.grouped_df
. This cost is already minimized through the use of C++, but we can do even better replacing group_by
with collapse::fgroup_by
. fgroup_by
works like group_by
but does the grouping with collapse::GRP
(up to 10x faster than group_by
) and simply attaches a collapse grouping object to the grouped_df. Thus the speed gain is 2-fold: Faster grouping and no conversion cost when calling collapse functions.
Another improvement comes from replacing the dplyr verb select
with collapse::fselect
, and, for selection using column names, indices or functions use collapse::get_vars
instead of select_at
or select_if
. Next to get_vars
, collapse also introduces the predicates num_vars
, cat_vars
, char_vars
, fact_vars
, logi_vars
and Date_vars
to efficiently select columns by type.
GGDC10S %>% fgroup_by(Variable, Country) %>% get_vars(6:16) %>% fmedian
# # A tibble: 85 x 13
# Variable Country AGR MIN MAN PU CON WRT TRA FIRE GOV OTH SUM
# <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 EMP ARG 1325. 47.4 1988. 1.05e2 7.82e2 1.85e3 5.80e2 464. 1739. 866. 9.74e3
# 2 EMP BOL 943. 53.5 167. 4.46e0 6.60e1 1.32e2 9.70e1 15.3 NA 384. 1.84e3
# 3 EMP BRA 17481. 225. 7208. 3.76e2 4.05e3 6.45e3 1.58e3 4355. 4450. 4479. 5.19e4
# 4 EMP BWA 175. 12.2 13.1 3.71e0 1.90e1 2.11e1 6.75e0 10.4 53.8 31.2 3.61e2
# 5 EMP CHL 690. 93.9 607. 2.58e1 2.30e2 4.84e2 2.05e2 106. NA 900. 3.31e3
# 6 EMP CHN 293915 8150. 61761. 1.14e3 1.06e4 1.70e4 9.56e3 4328. 19468. 9954. 4.45e5
# 7 EMP COL 3006. 84.0 1033. 3.71e1 4.19e2 1.55e3 3.91e2 655. NA 1430. 8.63e3
# 8 EMP CRI 216. 1.49 114. 7.92e0 5.50e1 8.98e1 2.55e1 19.6 122. 60.6 7.19e2
# 9 EMP DEW 2178 320. 8459. 2.47e2 2.10e3 4.45e3 1.53e3 1656 3700 900 2.65e4
# 10 EMP DNK 187. 3.75 508. 1.36e1 1.65e2 4.61e2 1.61e2 169. 642. 104. 2.42e3
# # ... with 75 more rows
microbenchmark(collapse = GGDC10S %>% fgroup_by(Variable, Country) %>% get_vars(6:16) %>% fmedian,
hybrid = GGDC10S %>% group_by(Variable, Country) %>% select_at(6:16) %>% fmedian,
dplyr = GGDC10S %>% group_by(Variable, Country) %>% select_at(6:16) %>% summarise_all(median, na.rm = TRUE))
# Unit: microseconds
# expr min lq mean median uq max neval cld
# collapse 938.012 987.7685 1037.213 1031.501 1052.251 1262.88 100 a
# hybrid 11944.255 12572.3480 13045.616 12755.979 13085.533 31442.58 100 b
# dplyr 60455.797 61915.9180 64354.441 62579.488 66871.717 81970.72 100 c
Benchmarks on the different components of this code and with larger data are provided under ‘Benchmarks’. Note that a grouped tibble created with fgroup_by
can no longer be used for grouped computations with dplyr verbs like mutate
or summarize
. To avoid errors with these functions and print.grouped_df
, [.grouped_df
etc., the classes assigned after fgroup_by
are reshuffled, so that the data.frame is treated by the dplyr ecosystem like a normal tibble:
class(group_by(GGDC10S, Variable, Country))
# [1] "grouped_df" "tbl_df" "tbl" "data.frame"
class(fgroup_by(GGDC10S, Variable, Country))
# [1] "tbl_df" "tbl" "grouped_df" "data.frame"
Also note that fselect
and get_vars
are not full drop-in replacements for select
because they do not have a grouped_df method:
GGDC10S %>% group_by(Variable, Country) %>% select_at(6:16) %>% head(3)
# # A tibble: 3 x 13
# # Groups: Variable, Country [1]
# Variable Country AGR MIN MAN PU CON WRT TRA FIRE GOV OTH SUM
# <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 VA BWA NA NA NA NA NA NA NA NA NA NA NA
# 2 VA BWA NA NA NA NA NA NA NA NA NA NA NA
# 3 VA BWA NA NA NA NA NA NA NA NA NA NA NA
GGDC10S %>% group_by(Variable, Country) %>% get_vars(6:16) %>% head(3)
# # A tibble: 3 x 11
# AGR MIN MAN PU CON WRT TRA FIRE GOV OTH SUM
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 NA NA NA NA NA NA NA NA NA NA NA
# 2 NA NA NA NA NA NA NA NA NA NA NA
# 3 NA NA NA NA NA NA NA NA NA NA NA
Since by default keep.group_vars = TRUE
in the Fast Statistical Functions, the end result is nevertheless the same:
GGDC10S %>% group_by(Variable, Country) %>% select_at(6:16) %>% fmean %>% head(3)
# # A tibble: 3 x 13
# Variable Country AGR MIN MAN PU CON WRT TRA FIRE GOV OTH SUM
# <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 EMP ARG 1420. 52.1 1932. 102. 742. 1982. 649. 628. 2043. 992. 10542.
# 2 EMP BOL 964. 56.0 235. 5.35 123. 282. 115. 44.6 NA 396. 2221.
# 3 EMP BRA 17191. 206. 6991. 365. 3525. 8509. 2054. 4414. 5307. 5710. 54273.
GGDC10S %>% group_by(Variable, Country) %>% get_vars(6:16) %>% fmean %>% head(3)
# # A tibble: 3 x 13
# Variable Country AGR MIN MAN PU CON WRT TRA FIRE GOV OTH SUM
# <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 EMP ARG 1420. 52.1 1932. 102. 742. 1982. 649. 628. 2043. 992. 10542.
# 2 EMP BOL 964. 56.0 235. 5.35 123. 282. 115. 44.6 NA 396. 2221.
# 3 EMP BRA 17191. 206. 6991. 365. 3525. 8509. 2054. 4414. 5307. 5710. 54273.
Another useful verb introduced by collapse is fgroup_vars
, which can be used to efficiently obtain the grouping columns or grouping variables from a grouped tibble:
# fgroup_by fully supports grouped tibbles created with group_by or fgroup_by:
GGDC10S %>% group_by(Variable, Country) %>% fgroup_vars %>% head(3)
# # A tibble: 3 x 2
# Variable Country
# <chr> <chr>
# 1 VA BWA
# 2 VA BWA
# 3 VA BWA
GGDC10S %>% fgroup_by(Variable, Country) %>% fgroup_vars %>% head(3)
# # A tibble: 3 x 2
# Variable Country
# <chr> <chr>
# 1 VA BWA
# 2 VA BWA
# 3 VA BWA
# The other possibilities:
GGDC10S %>% group_by(Variable, Country) %>% fgroup_vars("unique") %>% head(3)
# # A tibble: 3 x 2
# Variable Country
# <chr> <chr>
# 1 EMP ARG
# 2 EMP BOL
# 3 EMP BRA
GGDC10S %>% group_by(Variable, Country) %>% fgroup_vars("names")
# [1] "Variable" "Country"
GGDC10S %>% group_by(Variable, Country) %>% fgroup_vars("indices")
# [1] 4 1
GGDC10S %>% group_by(Variable, Country) %>% fgroup_vars("named_indices")
# Variable Country
# 4 1
GGDC10S %>% group_by(Variable, Country) %>% fgroup_vars("logical")
# [1] TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
GGDC10S %>% group_by(Variable, Country) %>% fgroup_vars("named_logical")
# Country Regioncode Region Variable Year AGR MIN MAN PU
# TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
# CON WRT TRA FIRE GOV OTH SUM
# FALSE FALSE FALSE FALSE FALSE FALSE FALSE
Another collapse verb to mention here is fsubset
, a faster alternative to dplyr::filter
which also provides an option to flexibly subset columns after the select argument:
# Two equivalent calls, the first is substantially faster
GGDC10S %>% fsubset(Variable == "VA" & Year > 1990, Country, Year, AGR:GOV) %>% head(3)
# Country Year AGR MIN MAN PU CON WRT TRA FIRE GOV
# 1 BWA 1991 303.1157 2646.950 472.6488 160.6079 580.0876 806.7509 232.7884 432.6965 1073.263
# 2 BWA 1992 333.4364 2690.939 537.4274 178.4532 678.7320 725.2577 285.1403 517.2141 1234.012
# 3 BWA 1993 404.5488 2624.928 567.3420 219.2183 634.2797 771.8253 349.7458 673.2540 1487.193
GGDC10S %>% filter(Variable == "VA" & Year > 1990) %>% select(Country, Year, AGR:GOV) %>% head(3)
# Country Year AGR MIN MAN PU CON WRT TRA FIRE GOV
# 1 BWA 1991 303.1157 2646.950 472.6488 160.6079 580.0876 806.7509 232.7884 432.6965 1073.263
# 2 BWA 1992 333.4364 2690.939 537.4274 178.4532 678.7320 725.2577 285.1403 517.2141 1234.012
# 3 BWA 1993 404.5488 2624.928 567.3420 219.2183 634.2797 771.8253 349.7458 673.2540 1487.193
collapse also offers roworder
, frename
, colorder
, ftransform
/TRA
and fungroup
as fast replacements for dplyr::arrange
, dplyr::rename
, dplyr::relocate
, dplyr::mutate
, and dplyr::ungroup
.
One can also aggregate with multiple functions at the same time. For such operations it is often necessary to use curly braces {
to prevent first argument injection so that %>% cbind(FUN1(.), FUN2(.))
does not evaluate as %>% cbind(., FUN1(.), FUN2(.))
:
GGDC10S %>%
fgroup_by(Variable, Country) %>%
get_vars(6:16) %>% {
cbind(fmedian(.),
add_stub(fmean(., keep.group_vars = FALSE), "mean_"))
} %>% head(3)
# Variable Country AGR MIN MAN PU CON WRT TRA
# 1 EMP ARG 1324.5255 47.35255 1987.5912 104.738825 782.40283 1854.612 579.93982
# 2 EMP BOL 943.1612 53.53538 167.1502 4.457895 65.97904 132.225 96.96828
# 3 EMP BRA 17480.9810 225.43693 7207.7915 375.851832 4054.66103 6454.523 1580.81120
# FIRE GOV OTH SUM mean_AGR mean_MIN mean_MAN mean_PU mean_CON
# 1 464.39920 1738.836 866.1119 9743.223 1419.8013 52.08903 1931.7602 101.720936 742.4044
# 2 15.34259 NA 384.0678 1842.055 964.2103 56.03295 235.0332 5.346433 122.7827
# 3 4354.86210 4449.942 4478.6927 51881.110 17191.3529 206.02389 6991.3710 364.573404 3524.7384
# mean_WRT mean_TRA mean_FIRE mean_GOV mean_OTH mean_SUM
# 1 1982.1775 648.5119 627.79291 2043.471 992.4475 10542.177
# 2 281.5164 115.4728 44.56442 NA 395.5650 2220.524
# 3 8509.4612 2054.3731 4413.54448 5307.280 5710.2665 54272.985
The function add_stub
used above is a collapse function adding a prefix (default) or suffix to variables names. The collapse predicate add_vars
provides a more efficient alternative to cbind.data.frame
. The idea here is ‘adding’ variables to the data.frame in the first argument i.e. the attributes of the first argument are preserved, so the expression below still gives a tibble instead of a data.frame:
GGDC10S %>%
fgroup_by(Variable, Country) %>% {
add_vars(get_vars(., "Reg", regex = TRUE) %>% ffirst, # Regular expression matching column names
num_vars(.) %>% fmean(keep.group_vars = FALSE) %>% add_stub("mean_"), # num_vars selects all numeric variables
fselect(., PU:TRA) %>% fmedian(keep.group_vars = FALSE) %>% add_stub("median_"),
fselect(., PU:CON) %>% fmin(keep.group_vars = FALSE) %>% add_stub("min_"))
} %>% head(3)
# # A tibble: 3 x 22
# Variable Country Regioncode Region mean_Year mean_AGR mean_MIN mean_MAN mean_PU mean_CON mean_WRT
# <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 EMP ARG LAM Latin~ 1980. 1420. 52.1 1932. 102. 742. 1982.
# 2 EMP BOL LAM Latin~ 1980 964. 56.0 235. 5.35 123. 282.
# 3 EMP BRA LAM Latin~ 1980. 17191. 206. 6991. 365. 3525. 8509.
# # ... with 11 more variables: mean_TRA <dbl>, mean_FIRE <dbl>, mean_GOV <dbl>, mean_OTH <dbl>,
# # mean_SUM <dbl>, median_PU <dbl>, median_CON <dbl>, median_WRT <dbl>, median_TRA <dbl>,
# # min_PU <dbl>, min_CON <dbl>
Another nice feature of add_vars
is that it can also very efficiently reorder columns i.e. bind columns in a different order than they are passed. This can be done by simply specifying the positions the added columns should have in the final data frame, and then add_vars
shifts the first argument columns to the right to fill in the gaps.
GGDC10S %>%
fsubset(Variable == "VA", Country, AGR, SUM) %>%
fgroup_by(Country) %>% {
add_vars(fgroup_vars(.,"unique"),
fmean(., keep.group_vars = FALSE) %>% add_stub("mean_"),
fsd(., keep.group_vars = FALSE) %>% add_stub("sd_"),
pos = c(2,4,3,5))
} %>% head(3)
# Country mean_AGR sd_AGR mean_SUM sd_SUM
# 1 ARG 14951.292 33061.413 152533.84 301316.25
# 2 BOL 3299.718 4456.331 22619.18 33172.98
# 3 BRA 76870.146 59441.696 1200562.67 976963.14
A much more compact solution to multi-function and multi-type aggregation is offered by the function collapg:
# This aggregates numeric colums using the mean (fmean) and categorical columns with the mode (fmode)
GGDC10S %>% fgroup_by(Variable, Country) %>% collapg %>% head(3)
# # A tibble: 3 x 16
# Variable Country Regioncode Region Year AGR MIN MAN PU CON WRT TRA FIRE GOV
# <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 EMP ARG LAM Latin~ 1980. 1420. 52.1 1932. 102. 742. 1982. 649. 628. 2043.
# 2 EMP BOL LAM Latin~ 1980 964. 56.0 235. 5.35 123. 282. 115. 44.6 NA
# 3 EMP BRA LAM Latin~ 1980. 17191. 206. 6991. 365. 3525. 8509. 2054. 4414. 5307.
# # ... with 2 more variables: OTH <dbl>, SUM <dbl>
By default it aggregates numeric columns using the fmean
and categorical columns using fmode
, and preserves the order of all columns. Changing these defaults is very easy:
# This aggregates numeric colums using the median and categorical columns using the first value
GGDC10S %>% fgroup_by(Variable, Country) %>% collapg(fmedian, flast) %>% head(3)
# # A tibble: 3 x 16
# Variable Country Regioncode Region Year AGR MIN MAN PU CON WRT TRA FIRE
# <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 EMP ARG LAM Latin~ 1980. 1325. 47.4 1988. 105. 782. 1855. 580. 464.
# 2 EMP BOL LAM Latin~ 1980 943. 53.5 167. 4.46 66.0 132. 97.0 15.3
# 3 EMP BRA LAM Latin~ 1980. 17481. 225. 7208. 376. 4055. 6455. 1581. 4355.
# # ... with 3 more variables: GOV <dbl>, OTH <dbl>, SUM <dbl>
One can apply multiple functions to both numeric and/or categorical data:
GGDC10S %>% fgroup_by(Variable, Country) %>%
collapg(list(fmean, fmedian), list(first, fmode, flast)) %>% head(3)
# # A tibble: 3 x 32
# Variable Country first.Regioncode fmode.Regioncode flast.Regioncode first.Region fmode.Region
# <chr> <chr> <chr> <chr> <chr> <chr> <chr>
# 1 EMP ARG LAM LAM LAM Latin Ameri~ Latin Ameri~
# 2 EMP BOL LAM LAM LAM Latin Ameri~ Latin Ameri~
# 3 EMP BRA LAM LAM LAM Latin Ameri~ Latin Ameri~
# # ... with 25 more variables: flast.Region <chr>, fmean.Year <dbl>, fmedian.Year <dbl>,
# # fmean.AGR <dbl>, fmedian.AGR <dbl>, fmean.MIN <dbl>, fmedian.MIN <dbl>, fmean.MAN <dbl>,
# # fmedian.MAN <dbl>, fmean.PU <dbl>, fmedian.PU <dbl>, fmean.CON <dbl>, fmedian.CON <dbl>,
# # fmean.WRT <dbl>, fmedian.WRT <dbl>, fmean.TRA <dbl>, fmedian.TRA <dbl>, fmean.FIRE <dbl>,
# # fmedian.FIRE <dbl>, fmean.GOV <dbl>, fmedian.GOV <dbl>, fmean.OTH <dbl>, fmedian.OTH <dbl>,
# # fmean.SUM <dbl>, fmedian.SUM <dbl>
Applying multiple functions to only numeric (or only categorical) data allows return in a long format:
GGDC10S %>% fgroup_by(Variable, Country) %>%
collapg(list(fmean, fmedian), cols = is.numeric, return = "long") %>% head(3)
# # A tibble: 3 x 15
# Function Variable Country Year AGR MIN MAN PU CON WRT TRA FIRE GOV OTH
# <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 fmean EMP ARG 1980. 1420. 52.1 1932. 102. 742. 1982. 649. 628. 2043. 992.
# 2 fmean EMP BOL 1980 964. 56.0 235. 5.35 123. 282. 115. 44.6 NA 396.
# 3 fmean EMP BRA 1980. 17191. 206. 6991. 365. 3525. 8509. 2054. 4414. 5307. 5710.
# # ... with 1 more variable: SUM <dbl>
Finally, collapg
also makes it very easy to apply aggregator functions to certain columns only:
GGDC10S %>% fgroup_by(Variable, Country) %>%
collapg(custom = list(fmean = 6:8, fmedian = 10:12)) %>% head(3)
# # A tibble: 3 x 8
# Variable Country fmean.AGR fmean.MIN fmean.MAN fmedian.CON fmedian.WRT fmedian.TRA
# <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 EMP ARG 1420. 52.1 1932. 782. 1855. 580.
# 2 EMP BOL 964. 56.0 235. 66.0 132. 97.0
# 3 EMP BRA 17191. 206. 6991. 4055. 6455. 1581.
To understand more about collapg
, look it up in the documentation (?collapg
).
Weighted aggregations are possible with the functions fsum, fprod, fmean, fmedian, fnth, fmode, fvar
and fsd
. The implementation is such that by default (option keep.w = TRUE
) these functions also aggregate the weights, so that further weighted computations can be performed on the aggregated data. fprod
saves the product of the weights, whereas the other functions save the sum of the weights in a column next to the grouping variables. If na.rm = TRUE
(the default), rows with missing weights are omitted from the computation.
# This computes a frequency-weighted grouped standard-deviation, taking the total EMP / VA as weight
GGDC10S %>%
fgroup_by(Variable, Country) %>%
fselect(AGR:SUM) %>% fsd(SUM) %>% head(3)
# # A tibble: 3 x 13
# Variable Country sum.SUM AGR MIN MAN PU CON WRT TRA FIRE GOV OTH
# <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 EMP ARG 653615. 225. 22.2 176. 20.5 285. 856. 195. 493. 1123. 506.
# 2 EMP BOL 135452. 99.7 17.1 168. 4.87 123. 324. 98.1 69.8 NA 258.
# 3 EMP BRA 3364925. 1587. 73.8 2952. 93.8 1861. 6285. 1306. 3003. 3621. 4257.
# This computes a weighted grouped mode, taking the total EMP / VA as weight
GGDC10S %>%
fgroup_by(Variable, Country) %>%
fselect(AGR:SUM) %>% fmode(SUM) %>% head(3)
# # A tibble: 3 x 13
# Variable Country sum.SUM AGR MIN MAN PU CON WRT TRA FIRE GOV OTH
# <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 EMP ARG 653615. 1162. 127. 2164. 152. 1415. 3768. 1060. 1748. 4336. 1999.
# 2 EMP BOL 135452. 819. 37.6 604. 10.8 433. 893. 333. 321. NA 1057.
# 3 EMP BRA 3364925. 16451. 313. 11841. 388. 8154. 21860. 5169. 12011. 12149. 14235.
The weighted variance / standard deviation is currently only implemented with frequency weights.
Weighted aggregations may also be performed with collapg
. By default fsum
is used to compute a sum of the weights, but it is also possible here to aggregate the weights with other functions:
# This aggregates numeric colums using the weighted mean (the default) and categorical columns using the weighted mode (the default).
# Weights (column SUM) are aggregated using both the sum and the maximum.
GGDC10S %>% group_by(Variable, Country) %>%
collapg(w = SUM, wFUN = list(fsum, fmax)) %>% head(3)
# # A tibble: 3 x 17
# Variable Country fsum.SUM fmax.SUM Regioncode Region Year AGR MIN MAN PU CON WRT
# <chr> <chr> <dbl> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 EMP ARG 653615. 17929. LAM Latin~ 1985. 1361. 56.5 1935. 105. 811. 2217.
# 2 EMP BOL 135452. 4508. LAM Latin~ 1987. 977. 57.9 296. 7.07 167. 400.
# 3 EMP BRA 3364925. 102572. LAM Latin~ 1989. 17746. 238. 8466. 389. 4436. 11376.
# # ... with 4 more variables: TRA <dbl>, FIRE <dbl>, GOV <dbl>, OTH <dbl>
collapse also provides some fast transformations that significantly extend the scope and speed of manipulations that can be performed with dplyr::mutate
.
The function ftransform
can be used to manipulate columns in the same ways as mutate
:
GGDC10S %>% fsubset(Variable == "VA", Country, Year, AGR, SUM) %>%
ftransform(AGR_perc = AGR / SUM * 100, # Computing % of VA in Agriculture
AGR_mean = fmean(AGR), # Average Agricultural VA
AGR = NULL, SUM = NULL) %>% # Deleting columns AGR and SUM
head
# Country Year AGR_perc AGR_mean
# 1 BWA 1960 NA 5137561
# 2 BWA 1961 NA 5137561
# 3 BWA 1962 NA 5137561
# 4 BWA 1963 NA 5137561
# 5 BWA 1964 43.49132 5137561
# 6 BWA 1965 39.96990 5137561
The modification brought by ftransformv
enables transformations of groups of columns like dplyr::mutate_at
and dplyr::mutate_if
:
# This replaces variables mpg, carb and wt by their log (.c turns expressions into character vectors)
mtcars %>% ftransformv(.c(mpg, carb, wt), log) %>% head
# mpg cyl disp hp drat wt qsec vs am gear carb
# Mazda RX4 3.044522 6 160 110 3.90 0.9631743 16.46 0 1 4 1.3862944
# Mazda RX4 Wag 3.044522 6 160 110 3.90 1.0560527 17.02 0 1 4 1.3862944
# Datsun 710 3.126761 4 108 93 3.85 0.8415672 18.61 1 1 4 0.0000000
# Hornet 4 Drive 3.063391 6 258 110 3.08 1.1678274 19.44 1 0 3 0.0000000
# Hornet Sportabout 2.928524 8 360 175 3.15 1.2354715 17.02 0 0 3 0.6931472
# Valiant 2.895912 6 225 105 2.76 1.2412686 20.22 1 0 3 0.0000000
# Logging numeric variables
iris %>% ftransformv(is.numeric, log) %>% head
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 1.629241 1.252763 0.3364722 -1.6094379 setosa
# 2 1.589235 1.098612 0.3364722 -1.6094379 setosa
# 3 1.547563 1.163151 0.2623643 -1.6094379 setosa
# 4 1.526056 1.131402 0.4054651 -1.6094379 setosa
# 5 1.609438 1.280934 0.3364722 -1.6094379 setosa
# 6 1.686399 1.360977 0.5306283 -0.9162907 setosa
Instead of column = value
type arguments, it is also possible to pass a single list of transformed variables to ftransform
, which will be regarded in the same way as an evaluated list of column = value
arguments. It can be used for more complex transformations:
# Logging values and replacing generated Inf values
mtcars %>% ftransform(fselect(., mpg, cyl, vs:gear) %>% lapply(log) %>% replace_Inf) %>% head
# mpg cyl disp hp drat wt qsec vs am gear carb
# Mazda RX4 3.044522 1.791759 160 110 3.90 2.620 16.46 NA 0 1.386294 4
# Mazda RX4 Wag 3.044522 1.791759 160 110 3.90 2.875 17.02 NA 0 1.386294 4
# Datsun 710 3.126761 1.386294 108 93 3.85 2.320 18.61 0 0 1.386294 1
# Hornet 4 Drive 3.063391 1.791759 258 110 3.08 3.215 19.44 0 NA 1.098612 1
# Hornet Sportabout 2.928524 2.079442 360 175 3.15 3.440 17.02 NA NA 1.098612 2
# Valiant 2.895912 1.791759 225 105 2.76 3.460 20.22 0 NA 1.098612 1
If only the computed columns need to be returned, fcompute
provides an efficient alternative:
GGDC10S %>% fsubset(Variable == "VA", Country, Year, AGR, SUM) %>%
fcompute(AGR_perc = AGR / SUM * 100,
AGR_mean = fmean(AGR)) %>% head
# AGR_perc AGR_mean
# 1 NA 5137561
# 2 NA 5137561
# 3 NA 5137561
# 4 NA 5137561
# 5 43.49132 5137561
# 6 39.96990 5137561
ftransform
and fcompute
are an order of magnitude faster than mutate
, but they do not support grouped computations using arbitrary functions. We will see that this is hardly a limitation as collapse provides very efficient and elegant alternative programming mechanisms…
All statistical (scalar-valued) functions in the collapse package (fsum, fprod, fmean, fmedian, fmode, fvar, fsd, fmin, fmax, fnth, ffirst, flast, fNobs, fNdistinct
) have a TRA
argument which can be used to efficiently transforms data by either (column-wise) replacing data values with computed statistics or sweeping the statistics out of the data. Operations can be specified using either an integer or quoted operator / string. The 10 operations supported by TRA
are:
1 - “replace_fill” : replace and overwrite missing values (same as mutate
)
2 - “replace” : replace but preserve missing values
3 - “-” : subtract (center)
4 - “-+” : subtract group-statistics but add average of group statistics
5 - “/” : divide (scale)
6 - “%” : compute percentages (divide and multiply by 100)
7 - “+” : add
8 - "*" : multiply
9 - “%%” : modulus
10 - “-%%” : subtract modulus
Simple transformations are again straightforward to specify:
# This subtracts the median value from all data points i.e. centers on the median
GGDC10S %>% num_vars %>% fmedian(TRA = "-") %>% head
# Year AGR MIN MAN PU CON WRT TRA FIRE GOV
# 1 -22 NA NA NA NA NA NA NA NA NA
# 2 -21 NA NA NA NA NA NA NA NA NA
# 3 -20 NA NA NA NA NA NA NA NA NA
# 4 -19 NA NA NA NA NA NA NA NA NA
# 5 -18 -4378.218 -169.7294 -3717.362 -167.8456 -1472.787 -3767.399 -1173.141 -959.0059 -3923.690
# 6 -17 -4378.792 -170.7277 -3717.080 -167.8149 -1472.101 -3766.578 -1172.861 -958.8783 -3922.817
# OTH SUM
# 1 NA NA
# 2 NA NA
# 3 NA NA
# 4 NA NA
# 5 -1430.831 -23148.71
# 6 -1430.494 -23146.85
# This replaces all data points with the mode
GGDC10S %>% char_vars %>% fmode(TRA = "replace") %>% head
# Country Regioncode Region Variable
# 1 USA ASI Asia EMP
# 2 USA ASI Asia EMP
# 3 USA ASI Asia EMP
# 4 USA ASI Asia EMP
# 5 USA ASI Asia EMP
# 6 USA ASI Asia EMP
Similarly for grouped transformations:
# Replacing data with the 2nd quartile (25%)
GGDC10S %>%
fselect(Variable, Country, AGR:SUM) %>%
fgroup_by(Variable, Country) %>% fnth(0.25, TRA = "replace_fill") %>% head(3)
# # A tibble: 3 x 13
# Variable Country AGR MIN MAN PU CON WRT TRA FIRE GOV OTH SUM
# <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 VA BWA 61.3 21.7 23.1 6.31 23.2 26.7 8.98 11.3 27.0 10.1 220.
# 2 VA BWA 61.3 21.7 23.1 6.31 23.2 26.7 8.98 11.3 27.0 10.1 220.
# 3 VA BWA 61.3 21.7 23.1 6.31 23.2 26.7 8.98 11.3 27.0 10.1 220.
# Scaling sectoral data by Variable and Country
GGDC10S %>%
fselect(Variable, Country, AGR:SUM) %>%
fgroup_by(Variable, Country) %>% fsd(TRA = "/") %>% head
# # A tibble: 6 x 13
# Variable Country AGR MIN MAN PU CON WRT TRA FIRE GOV
# <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 VA BWA NA NA NA NA NA NA NA NA NA
# 2 VA BWA NA NA NA NA NA NA NA NA NA
# 3 VA BWA NA NA NA NA NA NA NA NA NA
# 4 VA BWA NA NA NA NA NA NA NA NA NA
# 5 VA BWA 0.0270 5.56e-4 5.23e-4 3.88e-4 5.11e-4 0.00194 0.00154 5.23e-4 0.00134
# 6 VA BWA 0.0260 3.97e-4 7.23e-4 5.03e-4 1.04e-3 0.00220 0.00180 5.83e-4 0.00158
# # ... with 2 more variables: OTH <dbl>, SUM <dbl>
The benchmarks below will demonstrate that these internal sweeping and replacement operations fully performed in C++ compute significantly faster than using dplyr::mutate
, especially as the number of groups grows large. The S3 generic nature of the Fast Statistical Functions further allows us to perform grouped mutations on the fly (together with ftransform
or fcompute
), without the need of first creating a grouped tibble:
# AGR_gmed = TRUE if AGR is greater than it's median value, grouped by Variable and Country
# Note: This calls fmedian.default
settransform(GGDC10S, AGR_gmed = AGR > fmedian(AGR, list(Variable, Country), TRA = "replace"))
tail(GGDC10S, 3)
# Country Regioncode Region Variable Year AGR MIN MAN
# 5025 EGY MENA Middle East and North Africa EMP 2010 5205.529 28.99641 2435.549
# 5026 EGY MENA Middle East and North Africa EMP 2011 5185.919 27.56394 2373.814
# 5027 EGY MENA Middle East and North Africa EMP 2012 5160.590 24.78083 2348.434
# PU CON WRT TRA FIRE GOV OTH SUM AGR_gmed
# 5025 307.2712 2732.953 2977.063 1992.274 801.2984 5538.946 NA 22019.88 TRUE
# 5026 317.9979 2795.264 3020.236 2048.335 814.7403 5635.522 NA 22219.39 TRUE
# 5027 324.9332 2931.196 3109.522 2065.004 832.4770 5735.623 NA 22532.56 TRUE
# Dividing (scaling) the sectoral data (columns 6 through 16) by their grouped standard deviation
settransformv(GGDC10S, 6:16, fsd, list(Variable, Country), TRA = "/", apply = FALSE)
tail(GGDC10S, 3)
# Country Regioncode Region Variable Year AGR MIN MAN
# 5025 EGY MENA Middle East and North Africa EMP 2010 8.410070 2.279497 4.320400
# 5026 EGY MENA Middle East and North Africa EMP 2011 8.378388 2.166885 4.210888
# 5027 EGY MENA Middle East and North Africa EMP 2012 8.337467 1.948097 4.165868
# PU CON WRT TRA FIRE GOV OTH SUM AGR_gmed
# 5025 3.556820 3.621614 3.754698 3.750569 3.138243 3.795534 NA 4.441450 TRUE
# 5026 3.680987 3.704186 3.809148 3.856107 3.190888 3.861712 NA 4.481691 TRUE
# 5027 3.761267 3.884319 3.921756 3.887489 3.260353 3.930305 NA 4.544858 TRUE
Weights are easily added to any grouped transformation:
# This subtracts weighted group means from the data, using SUM column as weights..
GGDC10S %>%
fselect(Variable, Country, AGR:SUM) %>%
fgroup_by(Variable, Country) %>% fmean(SUM, "-") %>% head
# # A tibble: 6 x 13
# Variable Country SUM AGR MIN MAN PU CON WRT TRA FIRE GOV OTH
# <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 VA BWA NA NA NA NA NA NA NA NA NA NA NA
# 2 VA BWA NA NA NA NA NA NA NA NA NA NA NA
# 3 VA BWA NA NA NA NA NA NA NA NA NA NA NA
# 4 VA BWA NA NA NA NA NA NA NA NA NA NA NA
# 5 VA BWA 0.00180 -2.15 -2.12 -2.11 -1.97 -2.12 -2.03 -2.00 -2.07 -2.09 -2.02
# 6 VA BWA 0.00189 -2.16 -2.12 -2.11 -1.97 -2.12 -2.03 -2.00 -2.07 -2.09 -2.02
Sequential operations are also easily performed:
# This scales and then subtracts the median
GGDC10S %>%
fselect(Variable, Country, AGR:SUM) %>%
fgroup_by(Variable, Country) %>% fsd(TRA = "/") %>% fmedian(TRA = "-")
# # A tibble: 5,027 x 13
# Variable Country AGR MIN MAN PU CON WRT TRA FIRE GOV OTH SUM
# * <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 VA BWA NA NA NA NA NA NA NA NA NA NA NA
# 2 VA BWA NA NA NA NA NA NA NA NA NA NA NA
# 3 VA BWA NA NA NA NA NA NA NA NA NA NA NA
# 4 VA BWA NA NA NA NA NA NA NA NA NA NA NA
# 5 VA BWA -0.182 -0.235 -0.183 -0.245 -0.118 -0.0820 -0.0724 -0.0661 -0.108 -0.0848 -0.146
# 6 VA BWA -0.183 -0.235 -0.183 -0.245 -0.117 -0.0817 -0.0722 -0.0660 -0.108 -0.0846 -0.146
# 7 VA BWA -0.180 -0.235 -0.183 -0.245 -0.117 -0.0813 -0.0720 -0.0659 -0.107 -0.0843 -0.145
# 8 VA BWA -0.177 -0.235 -0.183 -0.245 -0.117 -0.0826 -0.0724 -0.0659 -0.107 -0.0841 -0.146
# 9 VA BWA -0.174 -0.235 -0.183 -0.245 -0.117 -0.0823 -0.0717 -0.0661 -0.108 -0.0848 -0.146
# 10 VA BWA -0.173 -0.234 -0.182 -0.243 -0.115 -0.0821 -0.0715 -0.0660 -0.108 -0.0846 -0.145
# # ... with 5,017 more rows
Of course it is also possible to combine multiple functions as in the aggregation section, or to add variables to existing data:
# This adds a groupwise observation count next to each column
add_vars(GGDC10S, seq(7,27,2)) <- GGDC10S %>%
fgroup_by(Variable, Country) %>% fselect(AGR:SUM) %>%
fNobs("replace_fill") %>% add_stub("N_")
head(GGDC10S)
# Country Regioncode Region Variable Year AGR N_AGR MIN N_MIN
# 1 BWA SSA Sub-saharan Africa VA 1960 NA 47 NA 47
# 2 BWA SSA Sub-saharan Africa VA 1961 NA 47 NA 47
# 3 BWA SSA Sub-saharan Africa VA 1962 NA 47 NA 47
# 4 BWA SSA Sub-saharan Africa VA 1963 NA 47 NA 47
# 5 BWA SSA Sub-saharan Africa VA 1964 0.02699257 47 0.0005557549 47
# 6 BWA SSA Sub-saharan Africa VA 1965 0.02604122 47 0.0003969678 47
# MAN N_MAN PU N_PU CON N_CON WRT N_WRT TRA N_TRA
# 1 NA 47 NA 47 NA 47 NA 47 NA 47
# 2 NA 47 NA 47 NA 47 NA 47 NA 47
# 3 NA 47 NA 47 NA 47 NA 47 NA 47
# 4 NA 47 NA 47 NA 47 NA 47 NA 47
# 5 0.0005231589 47 0.0003884954 47 0.0005107143 47 0.001939917 47 0.001541949 47
# 6 0.0007231902 47 0.0005027587 47 0.0010416550 47 0.002195030 47 0.001802278 47
# FIRE N_FIRE GOV N_GOV OTH N_OTH SUM N_SUM AGR_gmed
# 1 NA 47 NA 47 NA 47 NA 47 NA
# 2 NA 47 NA 47 NA 47 NA 47 NA
# 3 NA 47 NA 47 NA 47 NA 47 NA
# 4 NA 47 NA 47 NA 47 NA 47 NA
# 5 0.0005234982 47 0.001336269 47 0.001810803 47 0.001799950 47 FALSE
# 6 0.0005831804 47 0.001578270 47 0.002071448 47 0.001889501 47 FALSE
rm(GGDC10S)
There are lots of other examples one could construct using the 10 operations and 14 functions listed above, the examples provided just outline the suggested programming basics. Performance considerations make it very much worthwhile to spend some time and think how complex operations can be implemented in this programming framework, before defining some function in R and applying it to data using dplyr::mutate
.
TRA
FunctionTowards this end, calling TRA()
directly also facilitates more complex and customized operations. Behind the scenes of the TRA = ...
argument, the Fast Statistical Functions first compute the grouped statistics on all columns of the data, and these statistics are then directly fed into a C++ function that uses them to replace or sweep them out of data points in one of the 10 ways described above. This function can also be called directly by the name of TRA
.
Fundamentally, TRA
is a generalization of base::sweep
for column-wise grouped operations1. Direct calls to TRA
enable more control over inputs and outputs.
The two operations below are equivalent, although the first is slightly more efficient as it only requires one method dispatch and one check of the inputs:
# This divides by the product
GGDC10S %>%
fgroup_by(Variable, Country) %>%
get_vars(6:16) %>% fprod(TRA = "/") %>% head
# # A tibble: 6 x 11
# AGR MIN MAN PU CON WRT TRA FIRE GOV
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 NA NA NA NA NA NA NA NA NA
# 2 NA NA NA NA NA NA NA NA NA
# 3 NA NA NA NA NA NA NA NA NA
# 4 NA NA NA NA NA NA NA NA NA
# 5 1.29e-105 2.81e-127 1.40e-101 4.44e-74 4.19e-102 3.97e-113 6.91e-92 1.01e-97 2.51e-117
# 6 1.24e-105 2.00e-127 1.94e-101 5.75e-74 8.55e-102 4.49e-113 8.08e-92 1.13e-97 2.96e-117
# # ... with 2 more variables: OTH <dbl>, SUM <dbl>
# Same thing
GGDC10S %>%
fgroup_by(Variable, Country) %>%
get_vars(6:16) %>%
TRA(fprod(., keep.group_vars = FALSE), "/") %>% head # [same as TRA(.,fprod(., keep.group_vars = FALSE),"/")]
# # A tibble: 6 x 11
# AGR MIN MAN PU CON WRT TRA FIRE GOV
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 NA NA NA NA NA NA NA NA NA
# 2 NA NA NA NA NA NA NA NA NA
# 3 NA NA NA NA NA NA NA NA NA
# 4 NA NA NA NA NA NA NA NA NA
# 5 1.29e-105 2.81e-127 1.40e-101 4.44e-74 4.19e-102 3.97e-113 6.91e-92 1.01e-97 2.51e-117
# 6 1.24e-105 2.00e-127 1.94e-101 5.75e-74 8.55e-102 4.49e-113 8.08e-92 1.13e-97 2.96e-117
# # ... with 2 more variables: OTH <dbl>, SUM <dbl>
TRA.grouped_df
was designed such that it matches the columns of the statistics (aggregated columns) to those of the original data, and only transforms matching columns while returning the whole data frame. Thus it is easily possible to only apply a transformation to the first two sectors:
# This only demeans Agriculture (AGR) and Mining (MIN)
GGDC10S %>%
fgroup_by(Variable, Country) %>%
TRA(fselect(., AGR, MIN) %>% fmean(keep.group_vars = FALSE), "-") %>% head
# # A tibble: 6 x 16
# Country Regioncode Region Variable Year AGR MIN MAN PU CON WRT TRA FIRE GOV
# <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 BWA SSA Sub-s~ VA 1960 NA NA NA NA NA NA NA NA NA
# 2 BWA SSA Sub-s~ VA 1961 NA NA NA NA NA NA NA NA NA
# 3 BWA SSA Sub-s~ VA 1962 NA NA NA NA NA NA NA NA NA
# 4 BWA SSA Sub-s~ VA 1963 NA NA NA NA NA NA NA NA NA
# 5 BWA SSA Sub-s~ VA 1964 -446. -4505. 0.737 0.104 0.660 6.24 1.66 1.12 4.82
# 6 BWA SSA Sub-s~ VA 1965 -446. -4506. 1.02 0.135 1.35 7.06 1.94 1.25 5.70
# # ... with 2 more variables: OTH <dbl>, SUM <dbl>
Since TRA
is already built into all Fast Statistical Functions as an argument, it is best used in computations where grouped statistics are computed using some other function.
# Same as above, with one line of code using fmean.data.frame and ftransform...
GGDC10S %>% ftransform(fmean(list(AGR = AGR, MIN = MIN), list(Variable, Country), TRA = "-")) %>% head
# Country Regioncode Region Variable Year AGR MIN MAN PU
# 1 BWA SSA Sub-saharan Africa VA 1960 NA NA NA NA
# 2 BWA SSA Sub-saharan Africa VA 1961 NA NA NA NA
# 3 BWA SSA Sub-saharan Africa VA 1962 NA NA NA NA
# 4 BWA SSA Sub-saharan Africa VA 1963 NA NA NA NA
# 5 BWA SSA Sub-saharan Africa VA 1964 -445.8739 -4505.178 0.7365696 0.1043936
# 6 BWA SSA Sub-saharan Africa VA 1965 -446.4485 -4506.176 1.0181992 0.1350976
# CON WRT TRA FIRE GOV OTH SUM
# 1 NA NA NA NA NA NA NA
# 2 NA NA NA NA NA NA NA
# 3 NA NA NA NA NA NA NA
# 4 NA NA NA NA NA NA NA
# 5 0.6600454 6.243732 1.658928 1.119194 4.822485 2.341328 37.48229
# 6 1.3462312 7.064825 1.939007 1.246789 5.695848 2.678338 39.34710
Another potential use of TRA
is to do computations in two- or more steps, for example if both aggregated and transformed data are needed, or if computations are more complex and involve other manipulations in-between the aggregating and sweeping part:
# Get grouped tibble
gGGDC <- GGDC10S %>% fgroup_by(Variable, Country)
# Get aggregated data
gsumGGDC <- gGGDC %>% fselect(AGR:SUM) %>% fsum
head(gsumGGDC)
# # A tibble: 6 x 13
# Variable Country AGR MIN MAN PU CON WRT TRA FIRE GOV OTH SUM
# <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 EMP ARG 8.80e4 3230. 1.20e5 6307. 4.60e4 1.23e5 4.02e4 3.89e4 1.27e5 6.15e4 6.54e5
# 2 EMP BOL 5.88e4 3418. 1.43e4 326. 7.49e3 1.72e4 7.04e3 2.72e3 NA 2.41e4 1.35e5
# 3 EMP BRA 1.07e6 12773. 4.33e5 22604. 2.19e5 5.28e5 1.27e5 2.74e5 3.29e5 3.54e5 3.36e6
# 4 EMP BWA 8.84e3 493. 8.49e2 145. 1.19e3 1.71e3 3.93e2 7.21e2 2.87e3 1.30e3 1.85e4
# 5 EMP CHL 4.42e4 6389. 3.94e4 1850. 1.86e4 4.38e4 1.63e4 1.72e4 NA 6.32e4 2.51e5
# 6 EMP CHN 1.73e7 422972. 4.03e6 96364. 1.25e6 1.73e6 8.36e5 2.96e5 1.36e6 1.86e6 2.91e7
# Get transformed (scaled) data
head(TRA(gGGDC, gsumGGDC, "/"))
# # A tibble: 6 x 16
# Country Regioncode Region Variable Year AGR MIN MAN PU CON WRT
# <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 BWA SSA Sub-s~ VA 1960 NA NA NA NA NA NA
# 2 BWA SSA Sub-s~ VA 1961 NA NA NA NA NA NA
# 3 BWA SSA Sub-s~ VA 1962 NA NA NA NA NA NA
# 4 BWA SSA Sub-s~ VA 1963 NA NA NA NA NA NA
# 5 BWA SSA Sub-s~ VA 1964 7.50e-4 1.65e-5 1.66e-5 1.03e-5 1.57e-5 6.82e-5
# 6 BWA SSA Sub-s~ VA 1965 7.24e-4 1.18e-5 2.30e-5 1.33e-5 3.20e-5 7.72e-5
# # ... with 5 more variables: TRA <dbl>, FIRE <dbl>, GOV <dbl>, OTH <dbl>, SUM <dbl>
As discussed, whether using the argument to fast statistical functions or TRA
directly, these data transformations are essentially a two-step process: Statistics are first computed and then used to transform the original data.
Although both steps are efficiently done in C++, it would be even more efficient to do them in a single step without materializing all the statistics before transforming the data. Such slightly more efficient functions are provided for the very commonly applied tasks of centering and averaging data by groups (widely known as ‘between’-group and ‘within’-group transformations), and scaling and centering data by groups (also known as ‘standardizing’ data).
The functions fbetween
and fwithin
are slightly more memory efficient implementations of fmean
invoked with different TRA
options:
GGDC10S %>% # Same as ... %>% fmean(TRA = "replace")
fgroup_by(Variable, Country) %>% get_vars(6:16) %>% fbetween %>% head(2)
# # A tibble: 2 x 11
# AGR MIN MAN PU CON WRT TRA FIRE GOV OTH SUM
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 NA NA NA NA NA NA NA NA NA NA NA
# 2 NA NA NA NA NA NA NA NA NA NA NA
GGDC10S %>% # Same as ... %>% fmean(TRA = "replace_fill")
fgroup_by(Variable, Country) %>% get_vars(6:16) %>% fbetween(fill = TRUE) %>% head(2)
# # A tibble: 2 x 11
# AGR MIN MAN PU CON WRT TRA FIRE GOV OTH SUM
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 462. 4509. 942. 216. 895. 1948. 635. 1359. 2373. 773. 14112.
# 2 462. 4509. 942. 216. 895. 1948. 635. 1359. 2373. 773. 14112.
GGDC10S %>% # Same as ... %>% fmean(TRA = "-")
fgroup_by(Variable, Country) %>% get_vars(6:16) %>% fwithin %>% head(2)
# # A tibble: 2 x 11
# AGR MIN MAN PU CON WRT TRA FIRE GOV OTH SUM
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 NA NA NA NA NA NA NA NA NA NA NA
# 2 NA NA NA NA NA NA NA NA NA NA NA
Apart from higher speed, fwithin
has a mean
argument to assign an arbitrary mean to centered data, the default being mean = 0
. A very common choice for such an added mean is just the overall mean of the data, which can be added in by invoking mean = "overall.mean"
:
GGDC10S %>%
fgroup_by(Variable, Country) %>%
fselect(Country, Variable, AGR:SUM) %>% fwithin(mean = "overall.mean") %>% head(3)
# # A tibble: 3 x 13
# Country Variable AGR MIN MAN PU CON WRT TRA FIRE GOV OTH SUM
# <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 BWA VA NA NA NA NA NA NA NA NA NA NA NA
# 2 BWA VA NA NA NA NA NA NA NA NA NA NA NA
# 3 BWA VA NA NA NA NA NA NA NA NA NA NA NA
This can also be done using weights. The code below uses the SUM
column as weights, and then for each variable and each group subtracts out the weighted mean, and then adds the overall weighted column mean back to the centered columns. The SUM
column is just kept as it is and added after the grouping columns.
GGDC10S %>%
fgroup_by(Variable, Country) %>%
fselect(Country, Variable, AGR:SUM) %>% fwithin(SUM, mean = "overall.mean") %>% head(3)
# # A tibble: 3 x 13
# Country Variable SUM AGR MIN MAN PU CON WRT TRA FIRE GOV OTH
# <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 BWA VA NA NA NA NA NA NA NA NA NA NA NA
# 2 BWA VA NA NA NA NA NA NA NA NA NA NA NA
# 3 BWA VA NA NA NA NA NA NA NA NA NA NA NA
Another argument to fwithin
is the theta
parameter, allowing partial- or quasi-demeaning operations, e.g. fwithin(gdata, theta = theta)
is equal to gdata - theta * fbetween(gdata)
. This is particularly useful to prepare data for variance components (also known as ‘random-effects’) estimation.
Apart from fbetween
and fwithin
, the function fscale
exists to efficiently scale and center data, to avoid sequential calls such as ... %>% fsd(TRA = "/") %>% fmean(TRA = "-")
.
# This efficiently scales and centers (i.e. standardizes) the data
GGDC10S %>%
fgroup_by(Variable, Country) %>%
fselect(Country, Variable, AGR:SUM) %>% fscale
# # A tibble: 5,027 x 13
# Country Variable AGR MIN MAN PU CON WRT TRA FIRE GOV OTH SUM
# * <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 BWA VA NA NA NA NA NA NA NA NA NA NA NA
# 2 BWA VA NA NA NA NA NA NA NA NA NA NA NA
# 3 BWA VA NA NA NA NA NA NA NA NA NA NA NA
# 4 BWA VA NA NA NA NA NA NA NA NA NA NA NA
# 5 BWA VA -0.738 -0.717 -0.668 -0.805 -0.692 -0.603 -0.589 -0.635 -0.656 -0.596 -0.676
# 6 BWA VA -0.739 -0.717 -0.668 -0.805 -0.692 -0.603 -0.589 -0.635 -0.656 -0.596 -0.676
# 7 BWA VA -0.736 -0.717 -0.668 -0.805 -0.692 -0.603 -0.589 -0.635 -0.656 -0.595 -0.676
# 8 BWA VA -0.734 -0.717 -0.668 -0.805 -0.692 -0.604 -0.589 -0.635 -0.655 -0.595 -0.676
# 9 BWA VA -0.730 -0.717 -0.668 -0.805 -0.692 -0.604 -0.588 -0.635 -0.656 -0.596 -0.676
# 10 BWA VA -0.729 -0.716 -0.667 -0.803 -0.690 -0.603 -0.588 -0.635 -0.656 -0.596 -0.675
# # ... with 5,017 more rows
fscale
also has additional mean
and sd
arguments allowing the user to (group-) scale data to an arbitrary mean and standard deviation. Setting mean = FALSE
just scales the data but preserves the means, and is thus different from fsd(..., TRA = "/")
which simply divides all values by the standard deviation:
# Saving grouped tibble
gGGDC <- GGDC10S %>%
fgroup_by(Variable, Country) %>%
fselect(Country, Variable, AGR:SUM)
# Original means
head(fmean(gGGDC))
# # A tibble: 6 x 13
# Variable Country AGR MIN MAN PU CON WRT TRA FIRE GOV OTH SUM
# <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 EMP ARG 1420. 52.1 1932. 102. 742. 1.98e3 6.49e2 628. 2043. 9.92e2 1.05e4
# 2 EMP BOL 964. 56.0 235. 5.35 123. 2.82e2 1.15e2 44.6 NA 3.96e2 2.22e3
# 3 EMP BRA 17191. 206. 6991. 365. 3525. 8.51e3 2.05e3 4414. 5307. 5.71e3 5.43e4
# 4 EMP BWA 188. 10.5 18.1 3.09 25.3 3.63e1 8.36e0 15.3 61.1 2.76e1 3.94e2
# 5 EMP CHL 702. 101. 625. 29.4 296. 6.95e2 2.58e2 272. NA 1.00e3 3.98e3
# 6 EMP CHN 287744. 7050. 67144. 1606. 20852. 2.89e4 1.39e4 4929. 22669. 3.10e4 4.86e5
# Mean Preserving Scaling
head(fmean(fscale(gGGDC, mean = FALSE)))
# # A tibble: 6 x 13
# Variable Country AGR MIN MAN PU CON WRT TRA FIRE GOV OTH SUM
# <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 EMP ARG 1420. 52.1 1932. 102. 742. 1.98e3 6.49e2 628. 2043. 9.92e2 1.05e4
# 2 EMP BOL 964. 56.0 235. 5.35 123. 2.82e2 1.15e2 44.6 NA 3.96e2 2.22e3
# 3 EMP BRA 17191. 206. 6991. 365. 3525. 8.51e3 2.05e3 4414. 5307. 5.71e3 5.43e4
# 4 EMP BWA 188. 10.5 18.1 3.09 25.3 3.63e1 8.36e0 15.3 61.1 2.76e1 3.94e2
# 5 EMP CHL 702. 101. 625. 29.4 296. 6.95e2 2.58e2 272. NA 1.00e3 3.98e3
# 6 EMP CHN 287744. 7050. 67144. 1606. 20852. 2.89e4 1.39e4 4929. 22669. 3.10e4 4.86e5
head(fsd(fscale(gGGDC, mean = FALSE)))
# # A tibble: 6 x 13
# Variable Country AGR MIN MAN PU CON WRT TRA FIRE GOV OTH SUM
# <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 EMP ARG 1. 1. 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.
# 2 EMP BOL 1. 1.00 1. 1.00 1.00 1. 1. 1. NA 1. 1.
# 3 EMP BRA 1. 1. 1. 1.00 1. 1.00 1.00 1.00 1. 1.00 1.00
# 4 EMP BWA 1.00 1.00 1. 1. 1. 1.00 1. 1.00 1. 1.00 1.00
# 5 EMP CHL 1. 1. 1.00 1. 1. 1. 1.00 1. NA 1. 1.00
# 6 EMP CHN 1. 1. 1. 1.00 1.00 1. 1. 1. 1.00 1.00 1.
One can also set mean = "overall.mean"
, which group-centers columns on the overall mean as illustrated with fwithin
. Another interesting option is setting sd = "within.sd"
. This group-scales data such that every group has a standard deviation equal to the within-standard deviation of the data:
# Just using VA data for this example
gGGDC <- GGDC10S %>%
fsubset(Variable == "VA", Country, AGR:SUM) %>%
fgroup_by(Country)
# This calculates the within- standard deviation for all columns
fsd(num_vars(ungroup(fwithin(gGGDC))))
# AGR MIN MAN PU CON WRT TRA FIRE GOV OTH
# 45046972 40122220 75608708 3062688 30811572 44125207 20676901 16030868 20358973 18780869
# SUM
# 306429102
# This scales all groups to take on the within- standard deviation while preserving group means
fsd(fscale(gGGDC, mean = FALSE, sd = "within.sd"))
# # A tibble: 43 x 12
# Country AGR MIN MAN PU CON WRT TRA FIRE GOV OTH SUM
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 ARG 4.50e7 4.01e7 7.56e7 3.06e6 3.08e7 4.41e7 2.07e7 1.60e7 2.04e7 1.88e7 3.06e8
# 2 BOL 4.50e7 4.01e7 7.56e7 3.06e6 3.08e7 4.41e7 2.07e7 1.60e7 NA 1.88e7 3.06e8
# 3 BRA 4.50e7 4.01e7 7.56e7 3.06e6 3.08e7 4.41e7 2.07e7 1.60e7 2.04e7 1.88e7 3.06e8
# 4 BWA 4.50e7 4.01e7 7.56e7 3.06e6 3.08e7 4.41e7 2.07e7 1.60e7 2.04e7 1.88e7 3.06e8
# 5 CHL 4.50e7 4.01e7 7.56e7 3.06e6 3.08e7 4.41e7 2.07e7 1.60e7 NA 1.88e7 3.06e8
# 6 CHN 4.50e7 4.01e7 7.56e7 3.06e6 3.08e7 4.41e7 2.07e7 1.60e7 2.04e7 1.88e7 3.06e8
# 7 COL 4.50e7 4.01e7 7.56e7 3.06e6 3.08e7 4.41e7 2.07e7 1.60e7 NA 1.88e7 3.06e8
# 8 CRI 4.50e7 4.01e7 7.56e7 3.06e6 3.08e7 4.41e7 2.07e7 1.60e7 2.04e7 1.88e7 3.06e8
# 9 DEW 4.50e7 4.01e7 7.56e7 3.06e6 3.08e7 4.41e7 2.07e7 1.60e7 2.04e7 1.88e7 3.06e8
# 10 DNK 4.50e7 4.01e7 7.56e7 3.06e6 3.08e7 4.41e7 2.07e7 1.60e7 2.04e7 1.88e7 3.06e8
# # ... with 33 more rows
A grouped scaling operation with both mean = "overall.mean"
and sd = "within.sd"
thus efficiently achieves a harmonization of all groups in the first two moments without changing the fundamental properties (in terms of level and scale) of the data.
This section introduces 3 further powerful collapse functions: flag
, fdiff
and fgrowth
. The first function, flag
, efficiently computes sequences of fully identified lags and leads on time series and panel data. The following code computes 1 fully-identified panel-lag and 1 fully identified panel-lead of each variable in the data:
GGDC10S %>%
fselect(-Region, -Regioncode) %>%
fgroup_by(Variable, Country) %>% flag(-1:1, Year)
# # A tibble: 5,027 x 36
# Country Variable Year F1.AGR AGR L1.AGR F1.MIN MIN L1.MIN F1.MAN MAN L1.MAN F1.PU PU
# * <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 BWA VA 1960 NA NA NA NA NA NA NA NA NA NA NA
# 2 BWA VA 1961 NA NA NA NA NA NA NA NA NA NA NA
# 3 BWA VA 1962 NA NA NA NA NA NA NA NA NA NA NA
# 4 BWA VA 1963 16.3 NA NA 3.49 NA NA 0.737 NA NA 0.104 NA
# 5 BWA VA 1964 15.7 16.3 NA 2.50 3.49 NA 1.02 0.737 NA 0.135 0.104
# 6 BWA VA 1965 17.7 15.7 16.3 1.97 2.50 3.49 0.804 1.02 0.737 0.203 0.135
# 7 BWA VA 1966 19.1 17.7 15.7 2.30 1.97 2.50 0.938 0.804 1.02 0.203 0.203
# 8 BWA VA 1967 21.1 19.1 17.7 1.84 2.30 1.97 0.750 0.938 0.804 0.203 0.203
# 9 BWA VA 1968 21.9 21.1 19.1 5.24 1.84 2.30 2.14 0.750 0.938 0.578 0.203
# 10 BWA VA 1969 23.1 21.9 21.1 10.2 5.24 1.84 4.15 2.14 0.750 1.12 0.578
# # ... with 5,017 more rows, and 22 more variables: L1.PU <dbl>, F1.CON <dbl>, CON <dbl>,
# # L1.CON <dbl>, F1.WRT <dbl>, WRT <dbl>, L1.WRT <dbl>, F1.TRA <dbl>, TRA <dbl>, L1.TRA <dbl>,
# # F1.FIRE <dbl>, FIRE <dbl>, L1.FIRE <dbl>, F1.GOV <dbl>, GOV <dbl>, L1.GOV <dbl>, F1.OTH <dbl>,
# # OTH <dbl>, L1.OTH <dbl>, F1.SUM <dbl>, SUM <dbl>, L1.SUM <dbl>
If the time-variable passed does not exactly identify the data (i.e. because of gaps or repeated values in each group), all 3 functions will issue appropriate error messages. flag
, fdiff
and fgrowth
support unbalanced panels with different start and end periods and duration of coverage for each individual, but not irregular panels. A workaround for such panels exists with the function seqid
which generates a new panel-id identifying consecutive time-sequences at the sub-individual level, see ?seqid
.
It is also possible to omit the time-variable if one is certain that the data is sorted:
GGDC10S %>%
fselect(Variable, Country,AGR:SUM) %>%
fgroup_by(Variable, Country) %>% flag
# # A tibble: 5,027 x 13
# Variable Country AGR MIN MAN PU CON WRT TRA FIRE GOV OTH SUM
# * <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 VA BWA NA NA NA NA NA NA NA NA NA NA NA
# 2 VA BWA NA NA NA NA NA NA NA NA NA NA NA
# 3 VA BWA NA NA NA NA NA NA NA NA NA NA NA
# 4 VA BWA NA NA NA NA NA NA NA NA NA NA NA
# 5 VA BWA NA NA NA NA NA NA NA NA NA NA NA
# 6 VA BWA 16.3 3.49 0.737 0.104 0.660 6.24 1.66 1.12 4.82 2.34 37.5
# 7 VA BWA 15.7 2.50 1.02 0.135 1.35 7.06 1.94 1.25 5.70 2.68 39.3
# 8 VA BWA 17.7 1.97 0.804 0.203 1.35 8.27 2.15 1.36 6.37 2.99 43.1
# 9 VA BWA 19.1 2.30 0.938 0.203 0.897 4.31 1.72 1.54 7.04 3.31 41.4
# 10 VA BWA 21.1 1.84 0.750 0.203 1.22 5.17 2.44 1.03 5.03 2.36 41.1
# # ... with 5,017 more rows
fdiff
computes sequences of lagged-leaded and iterated differences as well as quasi-differences and log-differences on time series and panel data. The code below computes the 1 and 10 year first and second differences of each variable in the data:
GGDC10S %>%
fselect(-Region, -Regioncode) %>%
fgroup_by(Variable, Country) %>% fdiff(c(1, 10), 1:2, Year)
# # A tibble: 5,027 x 47
# Country Variable Year D1.AGR D2.AGR L10D1.AGR L10D2.AGR D1.MIN D2.MIN L10D1.MIN L10D2.MIN D1.MAN
# * <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 BWA VA 1960 NA NA NA NA NA NA NA NA NA
# 2 BWA VA 1961 NA NA NA NA NA NA NA NA NA
# 3 BWA VA 1962 NA NA NA NA NA NA NA NA NA
# 4 BWA VA 1963 NA NA NA NA NA NA NA NA NA
# 5 BWA VA 1964 NA NA NA NA NA NA NA NA NA
# 6 BWA VA 1965 -0.575 NA NA NA -0.998 NA NA NA 0.282
# 7 BWA VA 1966 1.95 2.53 NA NA -0.525 0.473 NA NA -0.214
# 8 BWA VA 1967 1.47 -0.488 NA NA 0.328 0.854 NA NA 0.134
# 9 BWA VA 1968 1.95 0.488 NA NA -0.460 -0.788 NA NA -0.188
# 10 BWA VA 1969 0.763 -1.19 NA NA 3.41 3.87 NA NA 1.39
# # ... with 5,017 more rows, and 35 more variables: D2.MAN <dbl>, L10D1.MAN <dbl>, L10D2.MAN <dbl>,
# # D1.PU <dbl>, D2.PU <dbl>, L10D1.PU <dbl>, L10D2.PU <dbl>, D1.CON <dbl>, D2.CON <dbl>,
# # L10D1.CON <dbl>, L10D2.CON <dbl>, D1.WRT <dbl>, D2.WRT <dbl>, L10D1.WRT <dbl>, L10D2.WRT <dbl>,
# # D1.TRA <dbl>, D2.TRA <dbl>, L10D1.TRA <dbl>, L10D2.TRA <dbl>, D1.FIRE <dbl>, D2.FIRE <dbl>,
# # L10D1.FIRE <dbl>, L10D2.FIRE <dbl>, D1.GOV <dbl>, D2.GOV <dbl>, L10D1.GOV <dbl>,
# # L10D2.GOV <dbl>, D1.OTH <dbl>, D2.OTH <dbl>, L10D1.OTH <dbl>, L10D2.OTH <dbl>, D1.SUM <dbl>,
# # D2.SUM <dbl>, L10D1.SUM <dbl>, L10D2.SUM <dbl>
Log-differences of the form \(log(x_t) - log(x_{t-s})\) are also easily computed.
GGDC10S %>%
fselect(-Region, -Regioncode) %>%
fgroup_by(Variable, Country) %>% fdiff(c(1, 10), 1, Year, log = TRUE)
# # A tibble: 5,027 x 25
# Country Variable Year Dlog1.AGR L10Dlog1.AGR Dlog1.MIN L10Dlog1.MIN Dlog1.MAN L10Dlog1.MAN
# * <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 BWA VA 1960 NA NA NA NA NA NA
# 2 BWA VA 1961 NA NA NA NA NA NA
# 3 BWA VA 1962 NA NA NA NA NA NA
# 4 BWA VA 1963 NA NA NA NA NA NA
# 5 BWA VA 1964 NA NA NA NA NA NA
# 6 BWA VA 1965 -0.0359 NA -0.336 NA 0.324 NA
# 7 BWA VA 1966 0.117 NA -0.236 NA -0.236 NA
# 8 BWA VA 1967 0.0796 NA 0.154 NA 0.154 NA
# 9 BWA VA 1968 0.0972 NA -0.223 NA -0.223 NA
# 10 BWA VA 1969 0.0355 NA 1.05 NA 1.05 NA
# # ... with 5,017 more rows, and 16 more variables: Dlog1.PU <dbl>, L10Dlog1.PU <dbl>,
# # Dlog1.CON <dbl>, L10Dlog1.CON <dbl>, Dlog1.WRT <dbl>, L10Dlog1.WRT <dbl>, Dlog1.TRA <dbl>,
# # L10Dlog1.TRA <dbl>, Dlog1.FIRE <dbl>, L10Dlog1.FIRE <dbl>, Dlog1.GOV <dbl>, L10Dlog1.GOV <dbl>,
# # Dlog1.OTH <dbl>, L10Dlog1.OTH <dbl>, Dlog1.SUM <dbl>, L10Dlog1.SUM <dbl>
Finally, it is also possible to compute quasi-differences and quasi-log-differences of the form \(x_t - \rho x_{t-s}\) or \(log(x_t) - \rho log(x_{t-s})\):
GGDC10S %>%
fselect(-Region, -Regioncode) %>%
fgroup_by(Variable, Country) %>% fdiff(t = Year, rho = 0.95)
# # A tibble: 5,027 x 14
# Country Variable Year AGR MIN MAN PU CON WRT TRA FIRE GOV OTH
# * <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 BWA VA 1960 NA NA NA NA NA NA NA NA NA NA
# 2 BWA VA 1961 NA NA NA NA NA NA NA NA NA NA
# 3 BWA VA 1962 NA NA NA NA NA NA NA NA NA NA
# 4 BWA VA 1963 NA NA NA NA NA NA NA NA NA NA
# 5 BWA VA 1964 NA NA NA NA NA NA NA NA NA NA
# 6 BWA VA 1965 0.241 -0.824 0.318 0.0359 0.719 1.13 0.363 0.184 1.11 0.454
# 7 BWA VA 1966 2.74 -0.401 -0.163 0.0743 0.0673 1.56 0.312 0.174 0.955 0.449
# 8 BWA VA 1967 2.35 0.427 0.174 0.0101 -0.381 -3.55 -0.323 0.246 0.988 0.465
# 9 BWA VA 1968 2.91 -0.345 -0.141 0.0101 0.365 1.08 0.804 -0.427 -1.66 -0.780
# 10 BWA VA 1969 1.82 3.50 1.43 0.385 2.32 0.841 0.397 0.252 0.818 0.385
# # ... with 5,017 more rows, and 1 more variable: SUM <dbl>
The quasi-differencing feature was added to fdiff
to facilitate the preparation of time series and panel data for least-squares estimations suffering from serial correlation following Cochrane & Orcutt (1949).
Finally, fgrowth
computes growth rates in the same way. By default exact growth rates are computed in percentage terms using \((x_t-x_{t-s}) / x_{t-s} \times 100\) (the default argument is scale = 100
). The user can also request growth rates obtained by log-differencing using \(log(x_t/ x_{t-s}) \times 100\).
# Exact growth rates, computed as: (x/lag(x) - 1) * 100
GGDC10S %>%
fselect(-Region, -Regioncode) %>%
fgroup_by(Variable, Country) %>% fgrowth(c(1, 10), 1, Year)
# # A tibble: 5,027 x 25
# Country Variable Year G1.AGR L10G1.AGR G1.MIN L10G1.MIN G1.MAN L10G1.MAN G1.PU L10G1.PU G1.CON
# * <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 BWA VA 1960 NA NA NA NA NA NA NA NA NA
# 2 BWA VA 1961 NA NA NA NA NA NA NA NA NA
# 3 BWA VA 1962 NA NA NA NA NA NA NA NA NA
# 4 BWA VA 1963 NA NA NA NA NA NA NA NA NA
# 5 BWA VA 1964 NA NA NA NA NA NA NA NA NA
# 6 BWA VA 1965 -3.52 NA -28.6 NA 38.2 NA 29.4 NA 104.
# 7 BWA VA 1966 12.4 NA -21.1 NA -21.1 NA 50.0 NA 0
# 8 BWA VA 1967 8.29 NA 16.7 NA 16.7 NA 0 NA -33.3
# 9 BWA VA 1968 10.2 NA -20 NA -20 NA 0 NA 35.7
# 10 BWA VA 1969 3.61 NA 185. NA 185. NA 185. NA 185.
# # ... with 5,017 more rows, and 13 more variables: L10G1.CON <dbl>, G1.WRT <dbl>, L10G1.WRT <dbl>,
# # G1.TRA <dbl>, L10G1.TRA <dbl>, G1.FIRE <dbl>, L10G1.FIRE <dbl>, G1.GOV <dbl>, L10G1.GOV <dbl>,
# # G1.OTH <dbl>, L10G1.OTH <dbl>, G1.SUM <dbl>, L10G1.SUM <dbl>
# Log-difference growth rates, computed as: log(x / lag(x)) * 100
GGDC10S %>%
fselect(-Region, -Regioncode) %>%
fgroup_by(Variable, Country) %>% fgrowth(c(1, 10), 1, Year, logdiff = TRUE)
# # A tibble: 5,027 x 25
# Country Variable Year Dlog1.AGR L10Dlog1.AGR Dlog1.MIN L10Dlog1.MIN Dlog1.MAN L10Dlog1.MAN
# * <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 BWA VA 1960 NA NA NA NA NA NA
# 2 BWA VA 1961 NA NA NA NA NA NA
# 3 BWA VA 1962 NA NA NA NA NA NA
# 4 BWA VA 1963 NA NA NA NA NA NA
# 5 BWA VA 1964 NA NA NA NA NA NA
# 6 BWA VA 1965 -3.59 NA -33.6 NA 32.4 NA
# 7 BWA VA 1966 11.7 NA -23.6 NA -23.6 NA
# 8 BWA VA 1967 7.96 NA 15.4 NA 15.4 NA
# 9 BWA VA 1968 9.72 NA -22.3 NA -22.3 NA
# 10 BWA VA 1969 3.55 NA 105. NA 105. NA
# # ... with 5,017 more rows, and 16 more variables: Dlog1.PU <dbl>, L10Dlog1.PU <dbl>,
# # Dlog1.CON <dbl>, L10Dlog1.CON <dbl>, Dlog1.WRT <dbl>, L10Dlog1.WRT <dbl>, Dlog1.TRA <dbl>,
# # L10Dlog1.TRA <dbl>, Dlog1.FIRE <dbl>, L10Dlog1.FIRE <dbl>, Dlog1.GOV <dbl>, L10Dlog1.GOV <dbl>,
# # Dlog1.OTH <dbl>, L10Dlog1.OTH <dbl>, Dlog1.SUM <dbl>, L10Dlog1.SUM <dbl>
fdiff
and fgrowth
can also perform leaded (forward) differences and growth rates (i.e. ... %>% fgrowth(-c(1, 10), 1:2, Year)
would compute one and 10-year leaded first and second differences). Again it is possible to perform sequential operations:
# This computes the 1 and 10-year growth rates, for the current period and lagged by one period
GGDC10S %>%
fselect(-Region, -Regioncode) %>%
fgroup_by(Variable, Country) %>% fgrowth(c(1, 10), 1, Year) %>% flag(0:1, Year)
# # A tibble: 5,027 x 47
# Country Variable Year G1.AGR L1.G1.AGR L10G1.AGR L1.L10G1.AGR G1.MIN L1.G1.MIN L10G1.MIN
# * <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 BWA VA 1960 NA NA NA NA NA NA NA
# 2 BWA VA 1961 NA NA NA NA NA NA NA
# 3 BWA VA 1962 NA NA NA NA NA NA NA
# 4 BWA VA 1963 NA NA NA NA NA NA NA
# 5 BWA VA 1964 NA NA NA NA NA NA NA
# 6 BWA VA 1965 -3.52 NA NA NA -28.6 NA NA
# 7 BWA VA 1966 12.4 -3.52 NA NA -21.1 -28.6 NA
# 8 BWA VA 1967 8.29 12.4 NA NA 16.7 -21.1 NA
# 9 BWA VA 1968 10.2 8.29 NA NA -20 16.7 NA
# 10 BWA VA 1969 3.61 10.2 NA NA 185. -20 NA
# # ... with 5,017 more rows, and 37 more variables: L1.L10G1.MIN <dbl>, G1.MAN <dbl>,
# # L1.G1.MAN <dbl>, L10G1.MAN <dbl>, L1.L10G1.MAN <dbl>, G1.PU <dbl>, L1.G1.PU <dbl>,
# # L10G1.PU <dbl>, L1.L10G1.PU <dbl>, G1.CON <dbl>, L1.G1.CON <dbl>, L10G1.CON <dbl>,
# # L1.L10G1.CON <dbl>, G1.WRT <dbl>, L1.G1.WRT <dbl>, L10G1.WRT <dbl>, L1.L10G1.WRT <dbl>,
# # G1.TRA <dbl>, L1.G1.TRA <dbl>, L10G1.TRA <dbl>, L1.L10G1.TRA <dbl>, G1.FIRE <dbl>,
# # L1.G1.FIRE <dbl>, L10G1.FIRE <dbl>, L1.L10G1.FIRE <dbl>, G1.GOV <dbl>, L1.G1.GOV <dbl>,
# # L10G1.GOV <dbl>, L1.L10G1.GOV <dbl>, G1.OTH <dbl>, L1.G1.OTH <dbl>, L10G1.OTH <dbl>,
# # L1.L10G1.OTH <dbl>, G1.SUM <dbl>, L1.G1.SUM <dbl>, L10G1.SUM <dbl>, L1.L10G1.SUM <dbl>
This section seeks to demonstrate that the functionality introduced in the preceding 2 sections indeed produces code that evaluates substantially faster than native dplyr.
To do this properly, the different components of a typical piped call (selecting / subsetting, ordering, grouping, and performing some computation) are bechmarked separately on 2 different data sizes.
All benchmarks are run on a Windows 8.1 laptop with a 2x 2.2 GHZ Intel i5 processor, 8GB DDR3 RAM and a Samsung 850 EVO SSD hard drive.
Bechmarks are run on the original GGDC10S
data used throughout this vignette and a larger dataset with approx. 1 million observations, obtained by replicating and row-binding GGDC10S
200 times while maintaining unique groups.
# This shows the groups in GGDC10S
GRP(GGDC10S, ~ Variable + Country)
# collapse grouping object of length 5027 with 85 ordered groups
#
# Call: GRP.default(X = GGDC10S, by = ~Variable + Country), X is unordered
#
# Distribution of group sizes:
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 4.00 53.00 62.00 59.14 63.00 65.00
#
# Groups with sizes:
# EMP.ARG EMP.BOL EMP.BRA EMP.BWA EMP.CHL EMP.CHN
# 62 61 62 52 63 62
# ---
# VA.TWN VA.TZA VA.USA VA.VEN VA.ZAF VA.ZMB
# 63 52 65 63 52 52
# This replicates the data 200 times
data <- replicate(200, GGDC10S, simplify = FALSE)
# This function adds a number i to the country and variable columns of each dataset
uniquify <- function(x, i) ftransform(x, lapply(unclass(x)[c(1,4)], paste0, i))
# Making datasets unique and row-binding them
data <- unlist2d(Map(uniquify, data, as.list(1:200)), idcols = FALSE)
fdim(data)
# [1] 1005400 16
# This shows the groups in the replicated data
GRP(data, ~ Variable + Country)
# collapse grouping object of length 1005400 with 17000 ordered groups
#
# Call: GRP.default(X = data, by = ~Variable + Country), X is unordered
#
# Distribution of group sizes:
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 4.00 53.00 62.00 59.14 63.00 65.00
#
# Groups with sizes:
# EMP1.ARG1 EMP1.BOL1 EMP1.BRA1 EMP1.BWA1 EMP1.CHL1 EMP1.CHN1
# 62 61 62 52 63 62
# ---
# VA99.TWN99 VA99.TZA99 VA99.USA99 VA99.VEN99 VA99.ZAF99 VA99.ZMB99
# 63 52 65 63 52 52
gc()
# used (Mb) gc trigger (Mb) max used (Mb)
# Ncells 1925730 102.9 3670961 196.1 3670961 196.1
# Vcells 19873576 151.7 28319693 216.1 23047670 175.9
## Selecting columns
# Small
microbenchmark(dplyr = select(GGDC10S, Country, Variable, AGR:SUM),
collapse = fselect(GGDC10S, Country, Variable, AGR:SUM))
# Unit: microseconds
# expr min lq mean median uq max neval cld
# dplyr 2785.476 2990.972 3785.35528 3269.654 4031.399 19673.258 100 b
# collapse 11.156 16.512 31.27768 29.676 38.601 189.655 100 a
# Large
microbenchmark(dplyr = select(data, Country, Variable, AGR:SUM),
collapse = fselect(data, Country, Variable, AGR:SUM))
# Unit: microseconds
# expr min lq mean median uq max neval cld
# dplyr 2922.474 3169.471 3455.85976 3354.8875 3627.991 6057.808 100 b
# collapse 11.157 16.065 29.23833 33.2455 37.485 85.679 100 a
## Subsetting columns
# Small
microbenchmark(dplyr = filter(GGDC10S, Variable == "VA"),
collapse = fsubset(GGDC10S, Variable == "VA"))
# Unit: microseconds
# expr min lq mean median uq max neval cld
# dplyr 1399.432 1577.485 1727.9992 1696.1860 1851.0340 2323.609 100 b
# collapse 153.063 181.400 257.4582 226.6935 331.1155 477.038 100 a
# Large
microbenchmark(dplyr = filter(data, Variable == "VA"),
collapse = fsubset(data, Variable == "VA"))
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# dplyr 17.453178 17.687011 23.874368 17.869526 21.784454 190.28073 100 b
# collapse 7.778537 8.092695 9.424132 8.168334 8.335453 29.07569 100 a
## Ordering rows
# Small
microbenchmark(dplyr = arrange(GGDC10S, desc(Country), Variable, Year),
collapse = roworder(GGDC10S, -Country, Variable, Year))
# Unit: microseconds
# expr min lq mean median uq max neval cld
# dplyr 8184.175 8458.6175 8971.686 8711.6395 9334.601 13803.768 100 b
# collapse 564.949 635.9025 762.329 693.9145 875.538 1238.783 100 a
# Large
microbenchmark(dplyr = arrange(data, desc(Country), Variable, Year),
collapse = roworder(data, -Country, Variable, Year), times = 2)
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# dplyr 2599.315 2599.315 2603.1014 2603.1014 2606.8883 2606.8883 2 b
# collapse 352.841 352.841 369.0206 369.0206 385.2002 385.2002 2 a
## Grouping
# Small
microbenchmark(dplyr = group_by(GGDC10S, Country, Variable),
collapse = fgroup_by(GGDC10S, Country, Variable))
# Unit: microseconds
# expr min lq mean median uq max neval cld
# dplyr 3513.305 3729.735 4579.5328 4529.8565 5306.3265 7613.425 100 b
# collapse 394.929 422.819 530.7712 471.4605 638.5805 1044.665 100 a
# Large
microbenchmark(dplyr = group_by(data, Country, Variable),
collapse = fgroup_by(data, Country, Variable), times = 10)
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# dplyr 76.81924 81.70832 103.52451 99.98751 118.78389 144.5373 10 a
# collapse 70.20764 72.23360 89.79366 89.60311 97.21185 127.2961 10 a
## Computing a new column
# Small
microbenchmark(dplyr = mutate(GGDC10S, NEW = AGR+1),
collapse = ftransform(GGDC10S, NEW = AGR+1))
# Unit: microseconds
# expr min lq mean median uq max neval cld
# dplyr 2587.788 2719.208 2903.6197 2796.4085 3011.7230 5338.456 100 b
# collapse 26.775 32.353 46.5795 39.9395 58.9045 107.993 100 a
# Large
microbenchmark(dplyr = mutate(data, NEW = AGR+1),
collapse = ftransform(data, NEW = AGR+1))
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# dplyr 6.727624 7.011661 9.542815 7.286995 8.217867 47.08444 100 b
# collapse 2.625273 3.824563 4.872467 3.944603 4.124218 32.16015 100 a
## All combined with pipes
# Small
microbenchmark(dplyr = filter(GGDC10S, Variable == "VA") %>%
select(Country, Year, AGR:SUM) %>%
arrange(desc(Country), Year) %>%
mutate(NEW = AGR+1) %>%
group_by(Country),
collapse = fsubset(GGDC10S, Variable == "VA", Country, Year, AGR:SUM) %>%
roworder(-Country, Year) %>%
ftransform(NEW = AGR+1) %>%
fgroup_by(Country))
# Unit: microseconds
# expr min lq mean median uq max neval cld
# dplyr 15580.278 15944.1925 17116.7876 16124.03 16560.014 31040.515 100 b
# collapse 720.244 812.1705 893.6506 851.44 937.343 1629.695 100 a
# Large
microbenchmark(dplyr = filter(data, Variable == "VA") %>%
select(Country, Year, AGR:SUM) %>%
arrange(desc(Country), Year) %>%
mutate(NEW = AGR+1) %>%
group_by(Country),
collapse = fsubset(data, Variable == "VA", Country, Year, AGR:SUM) %>%
roworder(-Country, Year) %>%
ftransform(NEW = AGR+1) %>%
fgroup_by(Country), times = 10)
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# dplyr 27.95115 30.444775 33.074153 30.625504 33.469439 51.665172 10 b
# collapse 7.30953 8.459064 8.501323 8.662999 8.839713 9.357359 10 a
gc()
# used (Mb) gc trigger (Mb) max used (Mb)
# Ncells 1931338 103.2 3670961 196.1 3670961 196.1
# Vcells 21389735 163.2 44616105 340.4 44616105 340.4
## Grouping the data
cgGGDC10S <- fgroup_by(GGDC10S, Variable, Country) %>% fselect(-Region, -Regioncode)
gGGDC10S <- group_by(GGDC10S, Variable, Country) %>% fselect(-Region, -Regioncode)
cgdata <- fgroup_by(data, Variable, Country) %>% fselect(-Region, -Regioncode)
gdata <- group_by(data, Variable, Country) %>% fselect(-Region, -Regioncode)
rm(data, GGDC10S)
gc()
# used (Mb) gc trigger (Mb) max used (Mb)
# Ncells 1948325 104.1 3670961 196.1 3670961 196.1
# Vcells 20490274 156.4 44616105 340.4 44616105 340.4
## Conversion of Grouping object: This time would be required extra in all hybrid calls
## i.e. when calling collapse functions on data grouped with dplyr::group_by
# Small
microbenchmark(GRP(gGGDC10S))
# Unit: microseconds
# expr min lq mean median uq max neval
# GRP(gGGDC10S) 166.45 169.574 184.0549 171.806 183.4075 345.396 100
# Large
microbenchmark(GRP(gdata))
# Unit: milliseconds
# expr min lq mean median uq max neval
# GRP(gdata) 32.37256 33.89293 35.72334 35.43583 37.05905 47.21386 100
## Sum
# Small
microbenchmark(dplyr = summarise_all(gGGDC10S, sum, na.rm = TRUE),
collapse = fsum(cgGGDC10S))
# Unit: microseconds
# expr min lq mean median uq max neval cld
# dplyr 8878.982 9221.254 9601.6576 9399.976 9712.572 17043.076 100 b
# collapse 236.065 253.469 353.7717 290.284 302.779 7129.693 100 a
# Large
microbenchmark(dplyr = summarise_all(gdata, sum, na.rm = TRUE),
collapse = fsum(cgdata), times = 10)
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# dplyr 596.3983 604.90515 626.68443 609.52493 613.27273 787.86207 10 b
# collapse 41.0311 41.24083 42.76254 41.91913 42.83572 48.89041 10 a
## Mean
# Small
microbenchmark(dplyr = summarise_all(gGGDC10S, mean.default, na.rm = TRUE),
collapse = fmean(cgGGDC10S))
# Unit: microseconds
# expr min lq mean median uq max neval cld
# dplyr 11904.539 12605.3705 13905.4984 13037.784 13326.506 34580.594 100 b
# collapse 254.361 278.2355 314.8902 311.481 336.917 465.882 100 a
# Large
microbenchmark(dplyr = summarise_all(gdata, mean.default, na.rm = TRUE),
collapse = fmean(cgdata), times = 10)
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# dplyr 1443.86968 1589.43153 1735.38027 1705.82420 1913.18304 2079.88139 10 b
# collapse 44.94335 45.02055 53.25979 45.67229 48.87479 84.43222 10 a
## Median
# Small
microbenchmark(dplyr = summarise_all(gGGDC10S, median, na.rm = TRUE),
collapse = fmedian(cgGGDC10S))
# Unit: microseconds
# expr min lq mean median uq max neval cld
# dplyr 54052.595 55337.5640 62975.4834 57875.1495 63560.7860 140828.949 100 b
# collapse 495.335 529.2495 577.9932 561.8255 599.5335 927.748 100 a
# Large
microbenchmark(dplyr = summarise_all(gdata, median, na.rm = TRUE),
collapse = fmedian(cgdata), times = 2)
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# dplyr 10023.36417 10023.36417 10179.66228 10179.66228 10335.96039 10335.96039 2 b
# collapse 84.26711 84.26711 86.39191 86.39191 88.51672 88.51672 2 a
## Standard Deviation
# Small
microbenchmark(dplyr = summarise_all(gGGDC10S, sd, na.rm = TRUE),
collapse = fsd(cgGGDC10S))
# Unit: microseconds
# expr min lq mean median uq max neval cld
# dplyr 24646.238 25476.704 26824.4733 25918.2655 26506.6425 34665.38 100 b
# collapse 431.521 464.097 505.1789 487.5255 516.7545 790.75 100 a
# Large
microbenchmark(dplyr = summarise_all(gdata, sd, na.rm = TRUE),
collapse = fsd(cgdata), times = 2)
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# dplyr 4841.99901 4841.99901 4995.99783 4995.99783 5149.99665 5149.99665 2 b
# collapse 85.60853 85.60853 86.28593 86.28593 86.96333 86.96333 2 a
## Maximum
# Small
microbenchmark(dplyr = summarise_all(gGGDC10S, max, na.rm = TRUE),
collapse = fmax(cgGGDC10S))
# Unit: microseconds
# expr min lq mean median uq max neval cld
# dplyr 11642.145 12097.318 12557.020 12266.891 12554.94 19887.90 100 b
# collapse 179.392 190.994 229.117 236.957 241.42 529.25 100 a
# Large
microbenchmark(dplyr = summarise_all(gdata, max, na.rm = TRUE),
collapse = fmax(cgdata), times = 10)
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# dplyr 1094.55534 1108.52109 1151.61069 1127.70035 1190.50073 1250.25771 10 b
# collapse 25.58916 25.61192 27.19672 25.94169 27.21796 34.62477 10 a
## First Value
# Small
microbenchmark(dplyr = summarise_all(gGGDC10S, first),
collapse = ffirst(cgGGDC10S, na.rm = FALSE))
# Unit: microseconds
# expr min lq mean median uq max neval cld
# dplyr 27781.571 29151.1040 37713.0008 32005.525 39130.5325 175534.493 100 b
# collapse 58.459 68.2765 110.1163 116.917 124.7265 254.808 100 a
# Large
microbenchmark(dplyr = summarise_all(gdata, first),
collapse = ffirst(cgdata, na.rm = FALSE), times = 10)
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# dplyr 5287.80853 5536.760512 6052.829610 5843.829293 6487.43323 7594.13408 10 b
# collapse 4.35939 4.782878 7.239961 5.775779 10.80186 11.97237 10 a
## Number of Distinct Values
# Small
microbenchmark(dplyr = summarise_all(gGGDC10S, n_distinct, na.rm = TRUE),
collapse = fNdistinct(cgGGDC10S))
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# dplyr 370.148289 389.433311 440.051493 404.357159 456.413466 911.286487 100 b
# collapse 1.295457 1.392739 1.714009 1.488458 1.670304 5.470991 100 a
# Large
microbenchmark(dplyr = summarise_all(gdata, n_distinct, na.rm = TRUE),
collapse = fNdistinct(cgdata), times = 5)
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# dplyr 79475.8625 80011.9974 81132.3027 80408.21 81346.1701 84419.2694 5 b
# collapse 329.4514 333.4636 364.8668 361.65 390.5476 409.2213 5 a
gc()
# used (Mb) gc trigger (Mb) max used (Mb)
# Ncells 1951095 104.2 3670961 196.1 3670961 196.1
# Vcells 20496602 156.4 44616105 340.4 44616105 340.4
Below are some additional benchmarks for weighted aggregations and aggregations using the statistical mode, which cannot easily or efficiently be performed with dplyr.
## Weighted Mean
# Small
microbenchmark(fmean(cgGGDC10S, SUM))
# Unit: microseconds
# expr min lq mean median uq max neval
# fmean(cgGGDC10S, SUM) 285.599 296.7545 310.1957 303.2255 307.9105 459.635 100
# Large
microbenchmark(fmean(cgdata, SUM), times = 10)
# Unit: milliseconds
# expr min lq mean median uq max neval
# fmean(cgdata, SUM) 49.36432 49.50623 53.60889 51.50207 55.776 63.3863 10
## Weighted Standard-Deviation
# Small
microbenchmark(fsd(cgGGDC10S, SUM))
# Unit: microseconds
# expr min lq mean median uq max neval
# fsd(cgGGDC10S, SUM) 432.86 452.9415 490.4216 455.618 467.8905 929.533 100
# Large
microbenchmark(fsd(cgdata, SUM), times = 10)
# Unit: milliseconds
# expr min lq mean median uq max neval
# fsd(cgdata, SUM) 78.84118 79.66987 87.02884 81.03962 98.85895 105.6825 10
## Statistical Mode
# Small
microbenchmark(fmode(cgGGDC10S))
# Unit: milliseconds
# expr min lq mean median uq max neval
# fmode(cgGGDC10S) 1.574807 1.608945 1.704473 1.657586 1.810425 1.997403 100
# Large
microbenchmark(fmode(cgdata), times = 10)
# Unit: milliseconds
# expr min lq mean median uq max neval
# fmode(cgdata) 363.2765 364.2016 382.6044 379.5054 397.086 420.8072 10
## Weighted Statistical Mode
# Small
microbenchmark(fmode(cgGGDC10S, SUM))
# Unit: milliseconds
# expr min lq mean median uq max neval
# fmode(cgGGDC10S, SUM) 1.762677 1.812434 1.912795 1.860182 1.937606 3.470911 100
# Large
microbenchmark(fmode(cgdata, SUM), times = 10)
# Unit: milliseconds
# expr min lq mean median uq max neval
# fmode(cgdata, SUM) 431.9522 462.3095 471.5851 466.6216 494.8882 499.4739 10
gc()
# used (Mb) gc trigger (Mb) max used (Mb)
# Ncells 1950542 104.2 3670961 196.1 3670961 196.1
# Vcells 20493197 156.4 44616105 340.4 44616105 340.4
## Replacing with group sum
# Small
microbenchmark(dplyr = mutate_all(gGGDC10S, sum, na.rm = TRUE),
collapse = fsum(cgGGDC10S, TRA = "replace_fill"))
# Unit: microseconds
# expr min lq mean median uq max neval cld
# dplyr 10398.45 11077.196 13284.4561 12050.685 14680.867 28479.948 100 b
# collapse 337.81 387.789 436.1445 416.349 481.501 695.254 100 a
# Large
microbenchmark(dplyr = mutate_all(gdata, sum, na.rm = TRUE),
collapse = fsum(cgdata, TRA = "replace_fill"), times = 10)
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# dplyr 1123.4025 1196.13550 1244.5459 1227.63074 1268.49941 1411.6154 10 b
# collapse 63.9044 71.43794 111.7684 80.88968 91.88053 394.9043 10 a
## Dividing by group sum
# Small
microbenchmark(dplyr = mutate_all(gGGDC10S, function(x) x/sum(x, na.rm = TRUE)),
collapse = fsum(cgGGDC10S, TRA = "/"))
# Unit: microseconds
# expr min lq mean median uq max neval cld
# dplyr 9400.199 9757.197 10469.7067 9906.690 10207.683 21045.022 100 b
# collapse 547.545 566.065 609.8953 600.426 607.789 1363.732 100 a
# Large
microbenchmark(dplyr = mutate_all(gdata, function(x) x/sum(x, na.rm = TRUE)),
collapse = fsum(cgdata, TRA = "/"), times = 10)
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# dplyr 1169.3321 1582.4174 1700.6890 1648.3897 2024.6757 2089.681 10 b
# collapse 120.3382 123.1678 157.5447 138.8315 154.5461 340.741 10 a
## Centering
# Small
microbenchmark(dplyr = mutate_all(gGGDC10S, function(x) x-mean.default(x, na.rm = TRUE)),
collapse = fwithin(cgGGDC10S))
# Unit: microseconds
# expr min lq mean median uq max neval cld
# dplyr 12377.56 13086.425 17582.817 14728.392 20003.258 63137.297 100 b
# collapse 309.25 356.775 420.981 376.187 428.398 2113.874 100 a
# Large
microbenchmark(dplyr = mutate_all(gdata, function(x) x-mean.default(x, na.rm = TRUE)),
collapse = fwithin(cgdata), times = 10)
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# dplyr 2781.7480 3013.05630 3611.4475 3637.8264 4048.2085 4641.7812 10 b
# collapse 78.7497 88.16463 171.8093 143.1789 179.3352 474.1199 10 a
## Centering and Scaling (Standardizing)
# Small
microbenchmark(dplyr = mutate_all(gGGDC10S, function(x) (x-mean.default(x, na.rm = TRUE))/sd(x, na.rm = TRUE)),
collapse = fscale(cgGGDC10S))
# Unit: microseconds
# expr min lq mean median uq max neval cld
# dplyr 30876.743 31498.1420 34679.3403 31906.235 34905.463 63674.133 100 b
# collapse 501.136 516.3085 567.0824 549.331 584.138 960.771 100 a
# Large
microbenchmark(dplyr = mutate_all(gdata, function(x) (x-mean.default(x, na.rm = TRUE))/sd(x, na.rm = TRUE)),
collapse = fscale(cgdata), times = 2)
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# dplyr 6815.0367 6815.0367 6898.6044 6898.6044 6982.1720 6982.1720 2 b
# collapse 104.2023 104.2023 124.3979 124.3979 144.5935 144.5935 2 a
## Lag
# Small
microbenchmark(dplyr_unordered = mutate_all(gGGDC10S, dplyr::lag),
collapse_unordered = flag(cgGGDC10S),
dplyr_ordered = mutate_all(gGGDC10S, dplyr::lag, order_by = "Year"),
collapse_ordered = flag(cgGGDC10S, t = Year))
# Unit: microseconds
# expr min lq mean median uq max neval cld
# dplyr_unordered 44826.879 46679.2510 50637.9331 47723.023 50856.1260 80238.831 100 b
# collapse_unordered 340.041 377.5255 449.8220 440.446 464.5440 874.198 100 a
# dplyr_ordered 124238.814 127791.8345 136933.8218 130585.120 136636.6790 191353.513 100 c
# collapse_ordered 310.588 348.9655 406.0808 375.071 415.2335 844.300 100 a
# Large
microbenchmark(dplyr_unordered = mutate_all(gdata, dplyr::lag),
collapse_unordered = flag(cgdata),
dplyr_ordered = mutate_all(gdata, dplyr::lag, order_by = "Year"),
collapse_ordered = flag(cgdata, t = Year), times = 2)
# Unit: milliseconds
# expr min lq mean median uq max neval
# dplyr_unordered 8807.75622 8807.75622 8956.36750 8956.36750 9104.97878 9104.97878 2
# collapse_unordered 46.64087 46.64087 51.28162 51.28162 55.92237 55.92237 2
# dplyr_ordered 27296.92937 27296.92937 28132.59569 28132.59569 28968.26200 28968.26200 2
# collapse_ordered 83.39648 83.39648 94.62562 94.62562 105.85477 105.85477 2
# cld
# b
# a
# c
# a
## First-Difference (unordered)
# Small
microbenchmark(dplyr_unordered = mutate_all(gGGDC10S, function(x) x - dplyr::lag(x)),
collapse_unordered = fdiff(cgGGDC10S))
# Unit: microseconds
# expr min lq mean median uq max neval cld
# dplyr_unordered 59924.763 63774.092 71231.3617 66303.644 76502.85 109085.595 100 b
# collapse_unordered 374.402 431.744 524.6622 477.039 546.43 1134.361 100 a
# Large
microbenchmark(dplyr_unordered = mutate_all(gdata, function(x) x - dplyr::lag(x)),
collapse_unordered = fdiff(cgdata), times = 2)
# Unit: milliseconds
# expr min lq mean median uq max neval
# dplyr_unordered 13010.48101 13010.48101 13384.45341 13384.45341 13758.42582 13758.42582 2
# collapse_unordered 48.63739 48.63739 55.65105 55.65105 62.66472 62.66472 2
# cld
# b
# a
gc()
# used (Mb) gc trigger (Mb) max used (Mb)
# Ncells 1952921 104.3 4741504 253.3 4741504 253.3
# Vcells 21542121 164.4 62556832 477.3 76753397 585.6
Below again some benchmarks for transformations not easily of efficiently performed with dplyr, such as centering on the overall mean, mean-preserving scaling, weighted scaling and centering, sequences of lags / leads, (iterated) panel-differences and growth rates.
# Centering on overall mean
microbenchmark(fwithin(cgdata, mean = "overall.mean"), times = 10)
# Unit: milliseconds
# expr min lq mean median uq max neval
# fwithin(cgdata, mean = "overall.mean") 66.96164 89.12183 97.07003 98.7485 104.3518 124.258 10
# Weighted Centering
microbenchmark(fwithin(cgdata, SUM), times = 10)
# Unit: milliseconds
# expr min lq mean median uq max neval
# fwithin(cgdata, SUM) 68.66005 89.6551 102.258 103.4307 114.6298 139.72 10
microbenchmark(fwithin(cgdata, SUM, mean = "overall.mean"), times = 10)
# Unit: milliseconds
# expr min lq mean median uq max
# fwithin(cgdata, SUM, mean = "overall.mean") 73.04979 94.59996 101.1462 99.11777 114.8663 139.778
# neval
# 10
# Weighted Scaling and Standardizing
microbenchmark(fsd(cgdata, SUM, TRA = "/"), times = 10)
# Unit: milliseconds
# expr min lq mean median uq max neval
# fsd(cgdata, SUM, TRA = "/") 139.7937 171.8882 191.0733 178.0029 188.963 340.0654 10
microbenchmark(fscale(cgdata, SUM), times = 10)
# Unit: milliseconds
# expr min lq mean median uq max neval
# fscale(cgdata, SUM) 95.23943 104.8632 132.5975 140.9041 146.9867 166.0495 10
# Sequence of lags and leads
microbenchmark(flag(cgdata, -1:1), times = 10)
# Unit: milliseconds
# expr min lq mean median uq max neval
# flag(cgdata, -1:1) 79.7667 108.5273 221.3109 238.5866 295.766 352.4519 10
# Iterated difference
microbenchmark(fdiff(cgdata, 1, 2), times = 10)
# Unit: milliseconds
# expr min lq mean median uq max neval
# fdiff(cgdata, 1, 2) 68.04736 85.90751 95.14523 95.61562 111.1995 113.5704 10
# Growth Rate
microbenchmark(fgrowth(cgdata,1), times = 10)
# Unit: milliseconds
# expr min lq mean median uq max neval
# fgrowth(cgdata, 1) 76.58898 96.65314 97.30958 98.39663 100.0393 112.0112 10
Timmer, M. P., de Vries, G. J., & de Vries, K. (2015). “Patterns of Structural Change in Developing Countries.” . In J. Weiss, & M. Tribe (Eds.), Routledge Handbook of Industry and Development. (pp. 65-83). Routledge.
Cochrane, D. & Orcutt, G. H. (1949). “Application of Least Squares Regression to Relationships Containing Auto-Correlated Error Terms”. Journal of the American Statistical Association. 44 (245): 32–61.
Prais, S. J. & Winsten, C. B. (1954). “Trend Estimators and Serial Correlation”. Cowles Commission Discussion Paper No. 383. Chicago.
Row-wise operations are not supported by TRA.↩︎