% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/pairwise-comparisons.R
\name{pairwise_comparison}
\alias{pairwise_comparison}
\title{Do Pairwise Comparisons of Scores}
\usage{
pairwise_comparison(
  scores,
  by = c("model"),
  metric = "auto",
  baseline = NULL,
  ...
)
}
\arguments{
\item{scores}{A data.table of scores as produced by \code{\link[=score]{score()}}.}

\item{by}{character vector with names of columns present in the input
data.frame. \code{by} determines how pairwise comparisons will be computed.
You will get a relative skill score for every grouping level determined in
\code{by}. If, for example, \code{by = c("model", "location")}. Then you will get a
separate relative skill score for every model in every location. Internally,
the data.frame will be split according \code{by} (but removing "model" before
splitting) and the pairwise comparisons will be computed separately for the
split data.frames.}

\item{metric}{A character vector of length one with the metric to do the
comparison on. The default is "auto", meaning that either "interval_score",
"crps", or "brier_score" will be selected where available.
See \code{\link[=available_metrics]{available_metrics()}} for available metrics.}

\item{baseline}{character vector of length one that denotes the baseline
model against which to compare other models.}

\item{...}{additional arguments for the comparison between two models. See
\code{\link[=compare_two_models]{compare_two_models()}} for more information.}
}
\value{
A ggplot2 object with a coloured table of summarised scores
}
\description{
Compute relative scores between different models making pairwise
comparisons. Pairwise comparisons are a sort of pairwise tournament where all
combinations of two models are compared against each other based on the
overlapping set of available forecasts common to both models.
Internally, a ratio of the mean scores of both models is computed.
The relative score of a model is then the geometric mean of all mean score
ratios which involve that model. When a baseline is provided, then that
baseline is excluded from the relative scores for individual models
(which therefore differ slightly from relative scores without a baseline)
and all relative scores are scaled by (i.e. divided by) the relative score of
the baseline model.
Usually, the function input should be unsummarised scores as
produced by \code{\link[=score]{score()}}.
Note that the function internally infers the \emph{unit of a single forecast} by
determining all columns in the input that do not correspond to metrics
computed by \code{\link[=score]{score()}}. Adding unrelated columns will change results in an
unpredictable way.

The code for the pairwise comparisons is inspired by an implementation by
Johannes Bracher.
The implementation of the permutation test follows the function
\code{permutationTest} from the \code{surveillance} package by Michael Höhle,
Andrea Riebler and Michaela Paul.
}
\examples{
data.table::setDTthreads(1) # only needed to avoid issues on CRAN

scores <- score(example_quantile)
pairwise <- pairwise_comparison(scores, by = "target_type")

library(ggplot2)
plot_pairwise_comparison(pairwise, type = "mean_scores_ratio") +
  facet_wrap(~target_type)
}
\author{
Nikos Bosse \email{nikosbosse@gmail.com}

Johannes Bracher, \email{johannes.bracher@kit.edu}
}
\keyword{scoring}
