% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/df_stats.R
\name{df_stats}
\alias{df_stats}
\title{Calculate statistics on a variable}
\usage{
df_stats(formula, data, ..., drop = TRUE, fargs = list(), sep = "_",
  format = c("wide", "long"), groups = NULL, long_names = TRUE,
  nice_names = FALSE, na.action = "na.warn")
}
\arguments{
\item{formula}{A formula indicating which variables are to be used.
Semantics are approximately as in \code{\link[=lm]{lm()}} since \code{\link[stats:model.frame]{stats::model.frame()}}
is used to turn the formula into a data frame.  But first conditions and \code{groups}
are re-expressed into a form that \code{\link[stats:model.frame]{stats::model.frame()}} can interpret.
See details.}

\item{data}{A data frame or list containing the variables.}

\item{...}{Functions used to compute the statistics.  If this is empty,
a default set of summary statistics is used.  Functions used must accept
a vector of values and return either a (possibly named) single value,
a (possibly named) vector of values, or a data frame with one row.
Functions can be specified with character strings, names, or expressions
that look like function calls wit the first argument missing.  The latter
option provides a convenient way to specify additional arguments.  See the
examples.
Note: If these arguments are named, those names will be used in the data
frame returned (see details).  Such names may not be among the names of the named
arguments of \code{df_stats}().}

\item{drop}{A logical indicating whether combinations of the grouping
variables that do not occur in \code{data} should be dropped from the
result.}

\item{fargs}{Arguments passed to the functions in \code{...}.}

\item{sep}{A character string to separate components of names.  Set to \code{""} if
you don't want separation.}

\item{format}{One of \code{"long"} or \code{"wide"} indicating the desired shape of the
returned data frame.}

\item{groups}{An expression to be evaluated in \code{data} and defining (additional) groups.
This isn't necessary, since these can be placed into the formula, but it is provided
for similarity to other functions from the \pkg{mosaic} package.}

\item{long_names}{A logical indicating whether the default names should include the name
of the variable being summarized as well as the summarizing function name in the default
case when names are not derived from the names of the returned object or
an argument name.}

\item{nice_names}{A logical indicating whether \code{\link[=make.names]{make.names()}} should be
used to force names of the returned data frame to by syntactically valid.}

\item{na.action}{A function (or character string naming a function) that determines how NAs are treated.
Options include \code{"na.warn"} which removes missing data and emits a warning,
\code{"na.pass"} which includes all of the data,
\code{"na.omit"} or \code{"na.exclude"} which silently discard missing data,
and \code{"na.fail"} which fails if there is missing data.
See \code{link[stats]{na.pass}()} and \code{\link[=na.warn]{na.warn()}} for details.
The default is \code{"na.warn"} unless no function are specified in \code{...}, in which case
\code{"na.pass"} is used since the default function reports the number of missing values.}
}
\value{
A data frame.
}
\description{
Creates a data frame of statistics calculated on one variable, possibly for each
group formed by combinations of additional variables.
The resulting data frame has one column
for each of the statistics requested as well as columns for any grouping variables.
}
\details{
Use a one-sided formula to compute summary statistics for the left hand side
expression over the entire data.
Use a two-sided formula to compute summary statistics for the left hand expression
for each combination of levels of the expressions occurring on the right hand side.
This is most useful when the left hand side is quantitative and each expression
on the right hand side has relatively few unique values.  A function like
\code{\link[mosaic:ntiles]{mosaic::ntiles()}} is often useful to create a few groups of roughly equal size
determined by ranges of a quantitative variable.  See the examples.

Note that unlike \code{dplyr::\link[dplyr]{summarise}()}, \code{df_stats()} ignores
any grouping defined in \code{data} if \code{data} is a grouped \code{tibble}.

Names of columns in the resulting data frame are determined as follows.  For named
arguments in \code{...}, the argument name is used.  For unnamed arguments, if the
statistic function returns a result with names, those names are used.  Else, a name is
computed from the expression in \code{...} and the name of the variable being summarized.
For functions that produce multiple
outputs without names, consecutive integers are appended to the names.
See the examples.
}
\section{Cautions Regarding Formulas}{


The use of \code{|} to define groups is tricky because (a) \code{\link[stats:model.frame]{stats::model.frame()}}
doesn't handle this sort of thing and (b) \code{|} is also used for logical or.  The
current algorithm for handling this will turn the first  occurrence of \code{|} into an attempt
to condition, so logical or cannot be used before conditioning in the formula.
If you have need of logical or, we suggest creating a new variable that contains the
results of evaluating the expression.

Similarly, addition (\code{+}) is used to separate grouping variables, not for
arithmetic.
}

\examples{
df_stats( ~ hp, data = mtcars)
# There are several ways to specify functions
df_stats( ~ hp, data = mtcars, mean, trimmed_mean = mean(trim = 0.1), "median",
  range, Q = quantile(c(0.25, 0.75)))
# force names to by syntactically valid
df_stats( ~ hp, data = mtcars, Q = quantile(c(0.25, 0.75)), nice_names = TRUE)
# shorter names
df_stats( ~ hp, data = mtcars, mean, trimmed_mean = mean(trim = 0.1), "median", range,
  long_names = FALSE)
# wide vs long format
df_stats( hp ~ cyl, data = mtcars, mean, median, range)
df_stats( hp ~ cyl, data = mtcars, mean, median, range, format = "long")
# More than one grouping variable -- 3 ways.
df_stats( hp ~ cyl + gear, data = mtcars, mean, median, range)
df_stats( hp ~ cyl | gear, data = mtcars, mean, median, range)
df_stats( hp ~ cyl, groups = gear, data = mtcars, mean, median, range)
# magrittr style piping is also supported
if(require(ggformula)) {
  mtcars \%>\%
  df_stats(hp ~ cyl)
  gf_violin(hp ~ cyl, data = mtcars, group = ~ cyl) \%>\%
  gf_point(mean_hp ~ cyl, data = df_stats(hp ~ cyl, data = mtcars, mean))
}

# can be used on a categorical response, too
if (require(mosaic)) {
  df_stats(sex ~ substance, data = HELPrct, table, prop_female = prop)
}
if (require(mosaic)) {
  df_stats(sex ~ substance, data = HELPrct, table, props)
}
}
