| Type: | Package |
| Title: | Obtain Unweighted Natural Strata that Balance Many Covariates |
| Version: | 2.0.1 |
| Description: | Natural strata can be used in observational studies to balance the distributions of many covariates across any number of treatment groups and any number of comparisons. These strata have proportional amounts of units within each stratum across the treatments, allowing for simple interpretation and aggregation across strata. Within each stratum, the units are chosen using randomized rounding of a linear program that balances many covariates. For more details, see Brumberg et al. (2022) <doi:10.1111/rssa.12848> and Brumberg et al.(2023) <doi:10.1093/jrsssc/qlad010>. To solve the linear program, the 'Gurobi' commercial optimization software is recommended, but not required. The 'gurobi' R package can be installed by following the instructions at https://docs.gurobi.com/projects/optimizer/en/current/reference/r/setup.html after claiming your free academic license at https://www.gurobi.com/academia/academic-program-and-licenses/. |
| URL: | https://github.com/kkbrum/natstrat, https://kkbrum.github.io/natstrat/, https://docs.gurobi.com/projects/optimizer/en/current/reference/r/setup.html |
| BugReports: | https://github.com/kkbrum/natstrat/issues |
| License: | GPL-3 |
| Encoding: | UTF-8 |
| LazyData: | true |
| RoxygenNote: | 7.3.1 |
| Imports: | Rglpk, stats, plyr, pps, sampling, ggplot2, rlang, ramify, slam |
| Depends: | R (≥ 2.10), caret |
| Suggests: | knitr, rmarkdown, markdown, testthat (≥ 3.0.0), DT, stringr, covr, gurobi |
| VignetteBuilder: | knitr |
| Config/testthat/edition: | 3 |
| NeedsCompilation: | no |
| Packaged: | 2026-02-03 21:54:23 UTC; katherine |
| Author: | Katherine Brumberg [aut, cre] |
| Maintainer: | Katherine Brumberg <kbrum@umich.edu> |
| Repository: | CRAN |
| Date/Publication: | 2026-02-03 23:00:22 UTC |
natstrat: Obtain Unweighted Natural Strata that Balance Many Covariates
Description
Natural strata can be used in observational studies to balance the distributions of many covariates across any number of treatment groups and any number of comparisons. These strata have proportional amounts of units within each stratum across the treatments, allowing for simple interpretation and aggregation across strata. Within each stratum, the units are chosen using randomized rounding of a linear program that balances many covariates. For more details, see Brumberg et al. (2022) <doi:10.1111/rssa.12848> and Brumberg et al.(2023) <doi:10.1093/jrsssc/qlad010>. To solve the linear program, the 'Gurobi' commercial optimization software is recommended, but not required. The 'gurobi' R package can be installed by following the instructions here after claiming your free academic license here.
Details
To achieve the desired ratio of control to treated units,
a subset of control units are
chosen using by optimizing the balance of many covariates using
either randomized rounding of a linear program or
an integer program. The main function in this package is optimize_controls().
To create the input constraints for this function, you should use
generate_constraints().
Author(s)
Maintainer: Katherine Brumberg kbrum@umich.edu
See Also
Useful links:
-
https://docs.gurobi.com/projects/optimizer/en/current/reference/r/setup.html
Report bugs at https://github.com/kkbrum/natstrat/issues
Linear program that selects which controls to use in order to optimize balance
Description
This linear program is used by optimize_controls() to choose which controls
to select.
Usage
balance_LP(
z,
X,
importances,
st,
st_vals,
S,
q_s,
N,
solver,
integer,
time_limit,
threads = 1,
weight_comp = 1
)
Arguments
z |
a factor with the |
X |
a matrix or data frame containing constraints in the columns. The number
of rows should equal the length of |
importances |
a vector with length equal to the number of constraints or columns
in |
st |
a stratum vector with the |
st_vals |
the unique stratum levels contained in |
S |
the number of unique stratum levels contained in |
q_s |
a named vector or matrix indicating how many control units are to be selected from each stratum.
If there is one control group and all treated units are desired, this can be a vector; otherwise,
this should have one row per treatment group, where the order of the rows matches the order of
the levels of |
N |
the total number of available controls in the data. |
solver |
a character stating which solver to use to run the linear program. Options are "Rglpk" (default) or "gurobi". You must have the 'gurobi' package installed to use the "gurobi" option. If available, this is the recommended solver. |
integer |
a logical stating whether to use a mixed integer programming solver
instead of randomized rounding. Default is |
time_limit |
numeric stating maximum amount of seconds for which the
program is allowed to run before aborting. Default is |
threads |
The maximum number of threads that should be used. This is only
applicable if |
Value
A list containing two elements:
lpdetailsThe output of either
gurobi()orRglpk_solve_LP(), except that ifgurobi()is used, the elementsobjvalandxare renamedoptimumandsolutionto be consistent with the output ofRglpk_solve_LP().oThe original output of either
gurobi()orRglpk_solve_LP().
Check covariate balance of the control and treated groups
Description
Reports standardized differences in means between the treated and control group before and after choosing a subset of controls. These differences are reported both across strata and within strata. This function can also generate love plots of the same quantities.
Usage
check_balance(
z,
X,
st,
selected,
treated = 1,
control = 0,
denom_variance = "treated",
plot = FALSE,
message = TRUE
)
Arguments
z |
a factor with the |
X |
a data frame containing the covariates in the columns over which balance is desired. The number
of rows should equal the length of |
st |
a stratum vector with the |
selected |
a boolean vector including whether each unit was selected as part of the treated and control
groups for analysis. Should be the same length as |
treated |
which treatment value should be considered the treated units. This
must be one of the values of |
control |
which treatment value should be considered the control units. This
must be one of the values of |
denom_variance |
character stating what variance to use in the standardization:
either the default "treated", meaning the standardization will use the
treated variance (across all strata), where the treated group is declared in
the |
plot |
a boolean denoting whether to generate love plots for the standardized differences. |
message |
a boolean denoting whether to print a message about the level of balance achieved |
Value
List containing:
- sd_across
matrix with one row per covariate and two columns: one for the standardized difference before a subset of controls were selected and one for after.
- sd_strata
matrix similar to
sd_across, but with separate standardized differences for each stratum for each covariate.- sd_strata_avg
matrix similar to
sd_across, but taking the average of the standardized differences within the strata, weighted by stratum size.- plot_across
ggplot object plotting
sd_across, only exists ifplot = TRUE.- plot_strata
a named list of ggplot objects plotting
sd_strata, one for each stratum, only exists ifplot = TRUE.- plot_strata_avg
ggplot object plotting
sd_strata_avg, only exists ifplot = TRUE.- plot_pair
ggplot object with two facets displaying
sd_acrossandsd_strata_avgwith one y-axis, only exists ifplot = TRUE.
Examples
data('nh0506')
# Create strata
age_cat <- cut(nh0506$age,
breaks = c(19, 39, 50, 85),
labels = c('< 40 years', '40 - 50 years', '> 50 years'))
strata <- age_cat : nh0506$sex
# Balance age, race, education, poverty ratio, and bmi both across and within the levels of strata
constraints <- generate_constraints(
balance_formulas = list(age + race + education + povertyr + bmi ~ 1 + strata),
z = nh0506$z,
data = nh0506)
# Choose one control for every treated unit in each stratum,
# balancing the covariates as described by the constraints
results <- optimize_controls(z = nh0506$z,
X = constraints$X,
st = strata,
importances = constraints$importances,
ratio = 1)
cov_data <- nh0506[, c('sex', 'age', 'race', 'education', 'povertyr', 'bmi')]
# Check balance
stand_diffs <- check_balance(z = nh0506$z,
X = cov_data,
st = strata,
selected = results$selected,
plot = TRUE)
Create matrix of balance constraints for linear program
Description
This creates the matrix of constraints that seek covariate balance for use in balance_LP()
which creates the linear program used by optimize_controls() to choose which controls
to select.
Usage
create_balance_matrices(X, z, N, nvars, kc2, q_s, return = "all")
Arguments
X |
a matrix or data frame containing constraints in the columns. The number
of rows should equal the length of |
z |
a factor with the |
N |
the total number of available controls in the data. |
q_s |
a named vector or matrix indicating how many units are to be selected from each stratum.
If there is one control group and all treated units are desired, this can be a vector; otherwise,
this should have one row per treatment group, where the order of the rows matches the order of
the levels of |
return |
one of "all", "A", or "X", denoting whether all matrices should be returned, or just A or just the X matrix blocks. |
Value
A list containing up to three elements:
AThe matrix of covariate values and +/- 1s that are used as coefficients for the unit indicators and the epsilons in order to set the epsilons equal to the covariate imbalances.
x_blkThe covariate values used as coefficients for the unit indicators in the objective function.
Create matrix of distances between strata
Description
Create a distance matrix between strata levels created from the interactions
of factors. Used as input to generate_qs().
Usage
create_dist_matrix(...)
Arguments
... |
any number of matrices that contain the distances between levels of a single stratifying factor. These should have both column and row names which correspond to the levels of the stratifying factor. |
Value
Matrix containing the distances between all levels of the factor of all interactions between the inputted factors. The row and column names correspond to the levels of the strata, formed by combining the level name of each stratifying factor separated with ':'.
Examples
data('nh0506')
age_cat <- cut(nh0506$age,
breaks = c(19, 39, 50, 85),
labels = c('< 40 years', '40 - 50 years', '> 50 years'))
age_dist <- matrix(data = c(0, 1, 2, 1, 0, 1, 2, 1, 0),
nrow = 3,
byrow = TRUE,
dimnames = list(levels(age_cat), levels(age_cat)))
sex_dist <- matrix(data = c(0, 1, 1, 0),
nrow = 2,
dimnames = list(levels(nh0506$sex), levels(nh0506$sex)))
strata_dist <- create_dist_matrix(age_dist, sex_dist)
Generate constraints to encourage covariate balance
Description
This function generates constraints that encourage covariate balance as specified.
The main inputs are formula like objects, where the left hand side indicates
the covariate to be balanced and the right hand side indicates the
groups within which to balance. The constraints are
weighted and standardized by stand() to be used in optimize_controls(). Missingness
indicators can also be added and weighted for any covariate that has NA values.
Usage
generate_constraints(
balance_formulas,
z,
data,
default_rhs = NULL,
weight_by_size = 0,
denom_variance = "treated",
treated = 1,
autogen_missing = 50
)
Arguments
balance_formulas |
a list of formulas where the left hand side represents the covariate to be balanced, and the terms on the right hand side represent the populations within which the covariate should be balanced. More information can be found in the details below. |
z |
a factor with the |
data |
a data frame containing the relevant covariates in the columns. The number
of rows should equal the length of |
default_rhs |
the list of |
weight_by_size |
numeric between 0 and 1 stating how to adjust constraints for the size of the population they represent. Default is 0, meaning imbalance within populations is viewed in absolute terms, not relative to the population size. The program may thus prioritize balancing the covariate in larger populations compared to smaller populations. A value of 1 means that imbalance will be measured relative to the population's size, not in absolute terms, implying that it is equally important to balance in every population. |
denom_variance |
character stating what variance to use in the standardization:
either the default "treated", meaning the standardization will use the
treated variance (across all strata), where the treated group is declared in
the |
treated |
which treatment value should be considered the treated group. This
must be one of the values of |
autogen_missing |
whether to automatically generate missingness constraints
and how heavily to prioritize them. Should be a numeric
or |
Value
A list with two named components:
Xa matrix with constraints as columns and the same number of rows as the inputs. The column names provide information about the constraints, including the covariate names and the factor and level to which it pertains.
importancesa named vector with names corresponding to the constraint names and values corresponding to how heavily that constraint should be prioritized, based on the information provided through
balance_formulas,weight_by_size, andautogen_missing.
Details
The balance_formulas argument can include formulas beyond those interpreted
by R to be formulas. Their interpretation is also different, as
explained below:
- Left hand side
The left hand side of the formula contains the covariate to be balanced. It can also be the sum of multiple covariates, in which case each term will be balanced individually according to the right hand side. Additionally, '.' on the left hand side will designate that all covariates in
datashould be balanced according to the designated or default right hand side (as usual, terms may be subtracted to remove them).- Right hand side
The right hand side should be the sum of factor, character, or boolean variables. The covariate of the left hand side will be balanced within each level of each term on the right hand side. The right hand side can also contain '.', meaning the covariate will be balanced across all levels of all categorical, character, or boolean variables found in
data(as usual, terms may be subtracted to remove them). In the most common case, the user will have one term on the right hand side consisting of the strata within which balance in desired.- Coefficients
The formulas can contain coefficients specifying how much to weight a certain set of constraints. Coefficients of the left hand side terms will weight all constraints generated for that covariate, and coefficients of the right hand side will weight the constraints generated for each level of that term.
- Intercept
The intercept term, 1, is automatically included on the right hand side of the formula, and designates that the covariate of the left hand side will be balanced across all control units. You may enter a different numeric > 0 that will signify how much to weight the constraint, or you may enter "- 1" or "+ 0" to remove the intercept and its associated constraint, as per usual.
Examples
data('nh0506')
# Create strata
age_cat <- cut(nh0506$age,
breaks = c(19, 39, 50, 85),
labels = c('< 40 years', '40 - 50 years', '> 50 years'))
strata <- age_cat : nh0506$sex
# Balance age, race, education, poverty ratio, and bmi both across and within the levels of strata
constraints <- generate_constraints(
balance_formulas = list(age + race + education + povertyr + bmi ~ 1 + strata),
z = nh0506$z,
data = nh0506)
# Balance age and race both across and within the levels of strata,
# with balance for race being prioritized twice as much as for age,
# and balance across strata prioritized twice as much as within.
# Balance education across and within strata,
# with balance within strata prioritized twice as much as across.
# Balance poverty ratio and bmi only within the levels of strata,
# as specified in the default_rhs argument
constraints <- generate_constraints(
balance_formulas = list(age + 2 * race ~ 2 + strata,
education ~ 1 + 2 * strata,
'povertyr',
'bmi'),
z = nh0506$z,
data = nh0506,
default_rhs = '0 + strata')
Calculate desired number of controls per stratum
Description
Figure out how many units to take from each stratum when some strata are deficient.
The result should be used as an input to optimize_controls().
Usage
generate_qs(
z,
st,
ratio,
treated = 1,
max_ratio = NULL,
max_extra_s = 5,
strata_dist = NULL
)
Arguments
z |
a factor with the |
st |
a stratum vector with the |
ratio |
a numeric or vector specifying the desired ratio of controls to 'treated' in
each stratum. If there is one control group and all treated units should be included,
this can be a numeric. Otherwise, this should be
a vector with one entry per treatment group, in the same order as the levels of
|
treated |
which treatment value should be considered the treated units. This
must be one of the values of |
max_ratio |
a numeric or vector specifying the maximum ratio to allow in a stratum to achieve
the overall |
max_extra_s |
single numeric or named vector or matrix with values corresponding to the maximum desired number
of extra controls to be chosen from each stratum to achieve the overall
|
strata_dist |
matrix with both row and column names with names corresponding
to the stratum values from |
Value
A named vector stating how many controls to take from each stratum.
Calculate standardized differences
Description
Calculate standardized differences in means between treated and control groups,
before and after refining the control group. Used within the check_balance function.
Usage
get_stand_diffs(
data,
z,
selected,
st = NULL,
ist = NULL,
treated = 1,
control = 0,
denom_variance = "treated"
)
Arguments
data |
A data frame with columns for which the standardized differences should be calculated. |
z |
a factor with the |
selected |
a boolean vector including whether each unit was selected as part of the treated and control
groups for analysis. Should be the same length as |
st |
a stratum vector with the |
ist |
The specific stratum for which the standardized differences should be calculated. |
treated |
which treatment value should be considered the treated units. This
must be one of the values of |
control |
which treatment value should be considered the control units. This
must be one of the values of |
denom_variance |
character stating what variance to use in the standardization:
either the default "treated", meaning the standardization will use the
treated variance (across all strata), where the treated group is declared in
the |
Value
data frame containing two columns, one for standardized differences before choosing a subset of controls, and one for after. The rows pertain to covariates.
Homocysteine and smoking example data
Description
NHANES 2005-2006 data on smoking and homocysteine levels in adults.
Usage
nh0506
Format
A data frame with 2928 rows and 11 variables:
- SEQN
NHANES identification number.
- z
smoking status treatment indicator: 1 = daily smoker, 0 = never smoker.
- sex
factor with levels "Male" and "Female".
- age
age in years, 20-85, with 85 recorded for everyone >= 85 years.
- race
factor with levels "Mexican American", "Other Hispanic", "Non-Hispanic White", "Non-Hispanic Black", and "Other Race - Including Multi-Racial".
- education
factor with levels "< Grade 9", "9-11th grade", "High school grad/GED", "Some college or AA degree", "College graduate or above".
- povertyr
ratio of family income to the poverty level, capped at 5 times poverty, has missing entries.
- bmi
BMI (body mass index), has missing entries.
- cigsperday30
cigarettes smoked per day, 0 for never smokers.
- cotinine
blood cotinine level, a biomarker of recent exposure to tobacco.
- homocysteine
homocysteine level.
Details
The code used to generate this data is documented in the source version of this package under 'data-raw/'. This data is composed of adults aged at least 20 years. Individuals who have smoked at least 100 cigarettes but do not now smoke at least 10 cigarettes daily are excluded. Individuals with missing homocysteine values, cotinine values, or smoking information are excluded. After filtering for all these criteria, one individual with unknown education remains and is also excluded. Missing values remain in the poverty ratio and bmi covariates.
Source
https://wwwn.cdc.gov/nchs/nhanes/ContinuousNhanes/Default.aspx?BeginYear=2005
Examples
data('nh0506')
Homocysteine and smoking example data with multiple control groups
Description
NHANES 2005-2006 data on smoking and homocysteine levels in adults, comparing daily smokers to never smokers and occasional smokers.
Usage
nh0506_3groups
Format
A data frame with 4457 rows and 11 variables:
- SEQN
NHANES identification number.
- z
smoking status treatment factor: 0 = never smoker, 1 = some smoking, 2 = daily smoker.
- sex
factor with levels "Male" and "Female".
- age
age in years, 20-85, with 85 recorded for everyone >= 85 years.
- race
factor with levels "Mexican American", "Other Hispanic", "Non-Hispanic White", "Non-Hispanic Black", and "Other Race - Including Multi-Racial".
- education
factor with levels "< Grade 9", "9-11th grade", "High school grad/GED", "Some college or AA degree", "College graduate or above".
- povertyr
ratio of family income to the poverty level, capped at 5 times poverty, has missing entries.
- bmi
BMI (body mass index), has missing entries.
- cigsperday30
cigarettes smoked per day, 0 for never smokers.
- cotinine
blood cotinine level, a biomarker of recent exposure to tobacco.
- homocysteine
homocysteine level.
Details
The code used to generate this data is documented in the source version of this package under 'data-raw/'. This data is composed of adults aged at least 20 years. Individuals who have smoked at least 100 cigarettes but do not now smoke at least 10 cigarettes daily are excluded. Individuals with missing homocysteine values, cotinine values, or smoking information are excluded. After filtering for all these criteria, five individuals with unknown education remain and are also excluded. Missing values remain in the poverty ratio and bmi covariates.
Source
https://wwwn.cdc.gov/nchs/nhanes/ContinuousNhanes/Default.aspx?BeginYear=2005
Examples
data('nh0506_3groups')
Select control units that optimize covariate balance
Description
Select control units within strata that optimize covariate balance. Uses randomized rounding of a linear program or a mixed integer linear program.
Usage
optimize_controls(
z,
X,
st,
importances = NULL,
treated = 1,
ratio = NULL,
q_s = NULL,
treated_star = NULL,
q_star_s = NULL,
weight_star = 1,
integer = FALSE,
solver = "Rglpk",
seed = NULL,
runs = 1,
time_limit = Inf,
threads = 1,
correct_sizes = TRUE,
low_memory = FALSE
)
Arguments
z |
a factor with the |
X |
a matrix or data frame containing constraints in the columns. The number
of rows should equal the length of |
st |
a stratum vector with the |
importances |
a vector with length equal to the number of constraints or columns
in |
treated |
which treatment value should be considered the treated units. This
must be one of the values of |
ratio |
a numeric or vector specifying the desired ratio of controls to 'treated' in
each stratum. If there is one control group and all treated units should be included,
this can be a numeric. Otherwise, this should be
a vector with one entry per treatment group, in the same order as the levels of
|
q_s |
a named vector or matrix indicating how many units are to be selected from each stratum.
If there is one control group and all treated units are desired, this can be a vector; otherwise,
this should have one row per treatment group, where the order of the rows matches the order of
the levels of |
treated_star |
which treatment value should be considered the treated units
for the supplemental comparison. This
must be one of the values of |
q_star_s |
a named vector or matrix,
indicating how many supplemental units are to be selected from each stratum.
The matrix should have one row per treatment group, where the order of the rows matches the order of
the levels of |
weight_star |
a numeric stating how much to prioritize balance between the supplemental units as compared to balance between the main units. If multiple supplemental comparisons are desired, this should be a vector with one entry per supplemental comparison. |
integer |
a logical stating whether to use a mixed integer programming solver
instead of randomized rounding. Default is |
solver |
a character stating which solver to use to run the linear program. Options are "Rglpk" (default) or "gurobi". You must have the 'gurobi' package installed to use the "gurobi" option. If available, this is the recommended solver. |
seed |
the seed to use when doing the randomized rounding of the linear program.
This will allow results to be reproduced if desired. The default is |
runs |
the number of times to run randomized rounding of the linear solution. The objective values of all runs will be reported, but the detailed results will only be reported for the run with the lowest objective value. The default is 1. |
time_limit |
numeric stating maximum amount of seconds for which the
program is allowed to run before aborting. Default is |
threads |
The maximum number of threads that should be used. This is only
applicable if |
correct_sizes |
boolean stating whether the desired sample sizes should
be exactly correct (if |
low_memory |
boolean stating whether some outputs should not be included
due to the scale of the problem being too large compared to memory space.
If |
Value
List containing:
objectiveobjective value of the randomized rounding or mixed integer linear program solution.
objective_wo_importancesobjective value of the randomized rounding or mixed integer linear program solution not weighted by the variable importances.
epsthe amount of imbalance obtained in each constraint from the linear program. The row names specify the covariate, the population of interest, and, if there are more than two comparison groups, which groups are being compared.
eps_starsame as
epsbut for the supplemental units instead of the units in the main comparison. If there are multiple supplemental comparisons, this is a list. If there are none, this isNULL.importancesthe importance of each on the balance constraints.
weight_starthe importance of balancing in the supplemental comparison relative to the main comparison. If there are multiple supplemental comparisons, this is a vector. If there are none, this is
NULL.selectedwhether each unit was selected for the main comparison.
selected_starwhether each unit was selected for the supplement. If there are multiple supplemental comparisons, this is a list. If there are none, this is
NULL.prthe linear program weight assigned to each unit for the main comparison.
pr_starthe linear program weight assigned to each unit for the supplement. If there are multiple supplemental comparisons, this is a list. If there are none, this is
NULL.rrdetailsA list containing:
seedthe seed used before commencing the random sampling.
run_objectivesthe objective values for each run of randomized rounding.
run_objectives_wo_importancesthe objective values for each run of randomized rounding, not scaled by constraint importances.
lpdetailsthe full return of the function
Rglpk_solve_LP()orgurobi()plus information about the epsilons and objective values for the linear program solution.
Examples
data('nh0506')
# Create strata
age_cat <- cut(nh0506$age,
breaks = c(19, 39, 50, 85),
labels = c('< 40 years', '40 - 50 years', '> 50 years'))
strata <- age_cat : nh0506$sex
# Balance age, race, education, poverty ratio, and bmi both across and within the levels of strata
constraints <- generate_constraints(
balance_formulas = list(age + race + education + povertyr + bmi ~ 1 + strata),
z = nh0506$z,
data = nh0506)
# Choose one control for every treated unit in each stratum,
# balancing the covariates as described by the constraints
results <- optimize_controls(z = nh0506$z,
X = constraints$X,
st = strata,
importances = constraints$importances,
ratio = 1)
# If you want to use a ratio that's not feasible,
# you can supply a vector of the desired number of controls per stratum, q_s,
# typically generated by creating a distance matrix between strata and using
# generate_qs():
## Not run:
age_dist <- matrix(data = c(0, 1, 2, 1, 0, 1, 2, 1, 0),
nrow = 3,
byrow = TRUE,
dimnames = list(levels(age_cat), levels(age_cat)))
sex_dist <- matrix(data = c(0, 1, 1, 0),
nrow = 2,
dimnames = list(levels(nh0506$sex), levels(nh0506$sex)))
strata_dist <- create_dist_matrix(age_dist, sex_dist)
qs <- generate_qs(z = nh0506$z,
st = strata,
ratio = 2.5,
max_ratio = 2.6,
max_extra_s = 0,
strata_dist = strata_dist)
results <- optimize_controls(z = nh0506$z,
X = constraints$X,
st = strata,
importances = constraints$importances,
q_s = qs)
## End(Not run)
# We can also have multiple treatment and control groups,
# as well as multiple simultaneous comparisons:
## Not run:
data('nh0506_3groups')
strata2 <- cut(nh0506_3groups$age, breaks = c(19, 39, 50, 85),
labels = c('< 40 years', '40 - 50 years', '> 50 years'))
constraints2 <- generate_constraints(
balance_formulas = list(age + race + education + povertyr + bmi + sex ~ 1 + strata2),
z = nh0506_3groups$z,
data = nh0506_3groups,
treated = 'daily smoker')
q_star_s <- matrix(c(rep(table(nh0506_3groups$z, strata2)['some smoking', ] -
table(nh0506_3groups$z, strata2)['daily smoker', ], 2),
rep(0, 3)), byrow = TRUE, nrow = 3,
dimnames = list(levels(nh0506_3groups$z), levels(strata2)))
results <- optimize_controls(z = nh0506_3groups$z,
X = constraints2$X,
importances = constraints2$importances,
st = strata2,
ratio = 1,
treated = 'daily smoker',
treated_star = 'some smoking',
q_star_s = q_star_s,
correct_sizes = FALSE)
## End(Not run)
Parse the nonstandard balance formulas
Description
This function takes nonstandard formulas as inputs and returns regular formulas
as well as lists of weights to be used in generate_constraints().
Usage
parse_formula(balance_formula, default_rhs = NULL, data = NULL)
Arguments
balance_formula |
a formula that may have multiple terms on the left hand side, "." on the left hand side, and coefficients, making it not a standard formula. |
default_rhs |
the list of |
data |
a data frame containing the relevant covariates in the columns. The number
of rows should equal the length of |
Value
A named list containing new_formula: the standard formula to be used
in generate_constraints, rhs_weights: a named list with the
coefficients of the terms on the right hand side, and lhs_weights:
a named list with coefficients of the terms on the left hand side.
Plot standardized differences in means
Description
Used within the check_balance function to plot the standardized differences calculated
in the format of Love (2002).
Usage
plot_stand_diffs(sds, type)
Arguments
sds |
Standardized differences generated with |
type |
One of the following, stating which standardized differences to plot:
|
Value
Either a ggplot object or a list of ggplot objects (if type is 'strata')
References
Love, T. E. (2002), "Displaying covariate balance after adjustment for selection bias", Joint Statistical Meetings, yumpu.com/en/document/read/41664623.
Solve the earthmover's distance problem
Description
Determine how many controls should be chosen from each stratum to minimize
the distance between the strata of the chosen controls and those that were desired.
Used within generate_qs().
Usage
presolve_EMD(S, desired_qs, max_s, strata_dist_flat)
Arguments
S |
the total number of strata. |
desired_qs |
a named vector containing the number of controls desired in each stratum with names matching the strata names. |
max_s |
a vector containing the maximum number of controls that should
be selected in each stratum. The order of the strata should match that of |
strata_dist_flat |
a flattened distance matrix between the strata. |
Value
A named vector with names identical to those of desired_qs and
elements containing the number of controls to select from the given stratum.
Process the desired sample sizes for optimize_controls()
Description
Processes the inputs to optimize_controls() to formulate sample size constraints.
Usage
process_qs(ratio, q_s, n_s, treated, k, group, st_vals, stratios)
Arguments
ratio |
a numeric or vector specifying the desired ratio of controls to 'treated' in
each stratum. If there is one control group and all treated units should be included,
this can be a numeric. Otherwise, this should be
a vector with one entry per treatment group, in the same order as the levels of
|
q_s |
a named vector or matrix indicating how many units are to be selected from each stratum.
If there is one control group and all treated units are desired, this can be a vector; otherwise,
this should have one row per treatment group, where the order of the rows matches the order of
the levels of |
treated |
which treatment value should be considered the treated units. This
must be one of the values of |
Value
A matrix of sample sizes for each treatment and stratum.
Sample integer solution from linear programming solution with correct sample sizes
Description
The linear programming solution of balance_LP() that is used
within optimize_controls() sometimes selects fractional units.
Here, we select any unit the linear programming solution chose with coefficient 1.
Then, we select the remaining required number of units from those that have
fractional solutions by sampling with probabilities equal to the linear
programming solution and fixed sample size. Used within optimize_controls()
if correct_sizes = TRUE.
Usage
randomized_rounding(o, N, st, st_vals, S, z)
Arguments
o |
linear programming results, as found in the 'o' element of the
returned list from |
N |
the total number of available controls in the data. |
st |
a stratum vector with the |
st_vals |
the unique stratum levels contained in |
S |
the number of unique stratum levels contained in |
z |
a factor with the |
Value
Dataframe with two columns: pr, which contains
the coefficient determined for that unit from the linear programming
solution, and select, a boolean vector stating whether that
unit was selected for inclusion by randomized rounding.
Sample integer solution from linear programming solution with sample sizes correct in expectation
Description
The linear programming solution of balance_LP() that is used
within optimize_controls() sometimes selects fractional control units.
Here, we select any unit the linear programming solution chose with coefficient 1.
Then, we select sample each unit with a fractional solution with
probability equal to the linear programming solution. The total sample
size is then correct in expectation. Used within optimize_controls()
if correct_sizes = FALSE.
Usage
randomized_rounding_expectation(o, N, n_comp)
Arguments
o |
linear programming results, as found in the 'o' element of the
returned list from |
N |
the total number of available controls in the data. |
Value
Dataframe with two columns: pr, which contains
the coefficient determined for that unit from the linear programming
solution, and select, a boolean vector stating whether that
unit was selected for inclusion by randomized rounding.
Standardize covariate vector for balance constraint
Description
This function is used by generate_constraints() to standardize
covariate vectors to become balance constraints.
The function divides the covariate values by the treated or average group
standard deviation (across strata).
Usage
stand(z, x, denom_variance = "treated", treated = 1, autogen_missing = 50)
Arguments
z |
a factor with the |
x |
a covariate vector with |
denom_variance |
character stating what variance to use in the standardization:
either the default "treated", meaning the standardization will use the
treated variance (across all strata), where the treated group is declared in
the |
treated |
which treatment value should be considered the treated units. This
must be one of the values of |
autogen_missing |
whether to automatically generate missingness constraint
and how heavily to prioritize it. Should be a numeric
or |
Value
A list with two components:
- covariate
a balance constraint for the standardized covariate values of all unites.
- missingness
a corresponding balance constraint for the rate of missingness if
autogen_missingnotNULL, otherwiseNULL.
Verify the inputs to optimize_controls()
Description
Makes sure that the inputs to optimize_controls() are in the correct
format and feasible.
Usage
verify_inputs(X, importances, ratio, q_s, st, z, treated, integer, solver)
Arguments
X |
a matrix or data frame containing constraints in the columns. The number
of rows should equal the length of |
importances |
a vector with length equal to the number of constraints or columns
in |
ratio |
a numeric or vector specifying the desired ratio of controls to 'treated' in
each stratum. If there is one control group and all treated units should be included,
this can be a numeric. Otherwise, this should be
a vector with one entry per treatment group, in the same order as the levels of
|
q_s |
a named vector or matrix indicating how many units are to be selected from each stratum.
If there is one control group and all treated units are desired, this can be a vector; otherwise,
this should have one row per treatment group, where the order of the rows matches the order of
the levels of |
st |
a stratum vector with the |
z |
a factor with the |
treated |
which treatment value should be considered the treated units. This
must be one of the values of |
integer |
a logical stating whether to use a mixed integer programming solver
instead of randomized rounding. Default is |
solver |
a character stating which solver to use to run the linear program. Options are "Rglpk" (default) or "gurobi". You must have the 'gurobi' package installed to use the "gurobi" option. If available, this is the recommended solver. |
Value
No return value. If there is a problem with the inputs to optimize_controls(),
an error is raised.
Verify the inputs to the earthmover's distance problem
Description
Check that the ratio, strata, and treated indicator provided to generate_qs()
are in the correct forms and that the desired ratio is feasible
across the population.
Usage
verify_inputs_EMD(ratio, st, z, treated = 1)
Arguments
ratio |
a numeric or vector specifying the desired ratio of controls to 'treated' in
each stratum. If there is one control group and all treated units should be included,
this can be a numeric. Otherwise, this should be
a vector with one entry per treatment group, in the same order as the levels of
|
st |
a stratum vector with the |
z |
a factor with the |
treated |
which treatment value should be considered the treated units. This
must be one of the values of |
Value
No return value. If there is a problem with the inputs to generate_qs(),
an error is raised.
Verify the inputs for supplemental comparisons to optimize_controls()
Description
Makes sure that the inputs for supplemental comparisons to optimize_controls() are in the correct
format and feasible.
Usage
verify_multi_comp_inputs(
q_s,
q_star_s,
n_s,
treated,
treated_star,
weight_star,
group,
correct_sizes
)
Arguments
q_s |
a named vector or matrix indicating how many units are to be selected from each stratum.
If there is one control group and all treated units are desired, this can be a vector; otherwise,
this should have one row per treatment group, where the order of the rows matches the order of
the levels of |
q_star_s |
a named vector or matrix,
indicating how many supplemental units are to be selected from each stratum.
The matrix should have one row per treatment group, where the order of the rows matches the order of
the levels of |
treated |
which treatment value should be considered the treated units. This
must be one of the values of |
treated_star |
which treatment value should be considered the treated units
for the supplemental comparison. This
must be one of the values of |
weight_star |
a numeric stating how much to prioritize balance between the supplemental units as compared to balance between the main units. If multiple supplemental comparisons are desired, this should be a vector with one entry per supplemental comparison. |
correct_sizes |
boolean stating whether the desired sample sizes should
be exactly correct (if |
Value
No return value. If there is a problem with the inputs to optimize_controls(),
an error is raised.