% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/ctbi.R
\name{ctbi}
\alias{ctbi}
\title{ctbi}
\usage{
ctbi(
  data.input,
  bin.side,
  bin.period,
  bin.center = NULL,
  bin.FUN = "mean",
  bin.max.f.NA = 0.2,
  SCI.min = 0.6,
  coeff.outlier = "auto",
  ylim = c(-Inf, +Inf)
)
}
\arguments{
\item{data.input}{two columns data.frame (or data.table) with the first column being the time component (POSIXct, Date or numeric) and the second column the value (numeric)}

\item{bin.side}{one side of any bin (same class as the time component)}

\item{bin.period}{time interval between two sides of a bin. If the time component x.t of data.input is numeric, \code{bin.period} is numeric. If x.t is POSIXct or Date, \code{bin.period} = 'k units', with k an integer and units = (seconds, minutes, hours, days, weeks, half-months, months, years, decades, centuries, millenaries)}

\item{bin.center}{if \code{bin.side} is not specified, one center of a bin (same class as the time component)}

\item{bin.FUN}{character ('mean', 'median' or 'sum') that defines the aggregating operator}

\item{bin.max.f.NA}{numeric between 0 and 1 that specifies the maximum fraction of missing values for a bin to be accepted. The minimum number of non-NA points for a bin to be accepted is bin.size.min.accepted = bin.size*(1-\code{bin.max.f.NA}) with bin.size the number of points per bin}

\item{SCI.min}{numeric between 0 and 1 that is compared to the Stacked Cycles Index (SCI). If SCI > \code{SCI.min}, missing values are imputed in accepted bins with the sum of the long-term and cyclic components. If \code{SCI.min} = \code{NA}, no values are imputed}

\item{coeff.outlier}{one of \code{coeff.outlier} = 'auto' (default value), \code{coeff.outlier} = 'gaussian', \code{coeff.outlier} = c(A,B,C) or \code{coeff.outlier} = \code{NA}. If \code{coeff.outlier} = 'auto', C = 36 and the coefficients A and B are calculated on \eqn{m_{*}}. If \code{coeff.outlier} = 'gaussian', \code{coeff.outlier} = c(0.08,2,36), adapted to the Gaussian distribution. If \code{coeff.outlier} = \code{NA}, no outliers are flagged}

\item{ylim}{numeric vector of length 2 that defines the range of possible values. Values strictly below ylim[1] or strictly above ylim[2] are set to NA. Values equal to ylim[1] or ylim[2] are discarded from the residuals}
}
\value{
A list that contains:

\bold{data0}, the raw dataset (same class as \code{data.input}), with 9 columns: (i) time; (ii) outlier-free and imputed data; (iii) index.bin: index of the bins associated with each data points (the index is negative if the bin is rejected); (iv) long.term: long-term trend; (v) cycle: cyclic component; (vi) residuals: residuals including the outliers; (vii) outliers: quarantined outliers; (viii) imputed: value of the imputed data points; (ix) time.bin: relative position of the data points in their bins, between 0 and 1

\bold{data1}, the aggregated dataset (same class as \code{data.input}), with 10 columns: (i) aggregated time (center of the bins); (ii) aggregated data; (iii) index.bin: index of the bin (negative value if the bin is rejected); (iv) bin.start: start of the bin; (v) bin.end: end of the bin; (vi) n.points: number of points per bin (including \code{NA} values); (vii) n.NA: number of NA values per bin, originally; (viii) n.outliers: number of outliers per bin; (ix) n.imputed: number of imputed points per bin; (x) variability associated with the aggregation (standard deviation for the mean, MAD for the median and nothing for the sum)

\bold{mean.cycle}, a dataset (same class as \code{data.input}) with bin.size rows and 4 columns: (i) generic.time.bin1: time of the first bin; (ii) mean: the mean stack of detrended data; (iii) sd: the standard deviation on the mean; (iv) time.bin: relative position of the data points in the bin, between 0 and 1

\bold{summary.bin}, a vector that contains bin.size (median number of points in non-empty bins), bin.size.min.accepted (minimum number of points for a bin to be accepted) and SCI

\bold{summary.outlier}, a vector that contains A, B, C, \eqn{m_{*}}, the size of the residuals (n), and the lower and upper outlier threshold
}
\description{
\bold{Please cite} the following companion paper if you're using the \code{ctbi} package: Ritter, F.: Technical note: A procedure to clean, decompose, and aggregate time series, Hydrol. Earth Syst. Sci., 27, 349–361, https://doi.org/10.5194/hess-27-349-2023, 2023.

The goal of \code{ctbi} is to \bold{clean}, \bold{decompose}, \bold{impute} and \bold{aggregate} univariate time series. \code{ctbi} stands for \emph{Cyclic/Trend decomposition using Bin Interpolation}: the time series is divided into a sequence of non-overlapping bins (inputs: \code{bin.side} or \code{bin.center}, and \code{bin.period}). Bins with enough data (input: \code{bin.max.f.NA}) are \emph{accepted}, and otherwise \emph{rejected} (their values are set to \code{NA}). The \bold{long-term trend} is a linear interpolation of the mean values between successive bins. The \bold{cyclic component} is the mean stack of detrended data within all accepted bins.

Outliers present in the residuals are flagged using an enhanced box plot rule (called \bold{Logbox}, input: \code{coeff.outlier}) that is adapted to non-Gaussian data and keeps the type I error at \eqn{\frac{0.1}{\sqrt{n}}} \% (percentage of erroneously flagged outliers). \bold{Logbox} replaces the original \eqn{\alpha = 1.5} constant of the box plot rule with \eqn{\alpha = A \times \log(n)+B+\frac{C}{n}}. The variable \eqn{n \geq 9} is the sample size, \eqn{C = 36} corrects biases emerging in small samples, and \eqn{A} and \eqn{B} are automatically calculated on a predictor of the maximum tail weight (\eqn{m_{*}}).

The strength of the \bold{cyclic pattern} within each bin is quantified by a new metric, the \bold{Stacked Cycles Index} defined as \eqn{SCI = 1 - \frac{SS_{res}}{SS_{tot}} - \frac{1}{N_{bin}}}. The variable \eqn{SS_{tot}} is the sum of the squared detrended data, \eqn{SS_{res}} is the sum of the squared detrended & deseasonalized data, and \eqn{N_{bin}} is the number of accepted bins. A value of \eqn{SCI \leq 0} is associated  with no cyclicity, while \eqn{SCI = 1} is associated with a perfectly cyclic signal. Data can be imputed if \eqn{SCI_{min} \leq SCI} (input: \code{SCI.min}). Finally, data are aggregated in each bin (input: \code{bin.FUN}).

Important functions of the package: \code{ctbi}, \code{ctbi.outlier} (flag outliers in univariate datasets with the \bold{Logbox} method) and \code{ctbi.plot} (plot the time series).
}
\examples{
# example of the contaminated sunspot data
example1 <- data.frame(year = 1700:1988,sunspot = as.numeric(sunspot.year))
example1[sample(1:289,30),'sunspot'] <- NA # contaminate data with missing values
example1[c(5,30,50),'sunspot'] <- c(-50,300,400) # contaminate data with outliers
example1 <- example1[-(70:100),] # create gap in the data
bin.period <- 11 # aggregation performed every 11 years (the year is numeric here)
bin.side <- 1989 # give one side of a bin
bin.FUN <- 'mean'
bin.max.f.NA <- 0.2 # maximum of 20\% of missing data per bin
ylim <- c(0,Inf) # negative values are impossible

list.main <- ctbi(example1,bin.period=bin.period,
                       bin.side=bin.side,bin.FUN=bin.FUN,
                       ylim=ylim,bin.max.f.NA=bin.max.f.NA)
data0.example1 <- list.main$data0 # cleaned raw dataset
data1.example1 <- list.main$data1 # aggregated dataset.
mean.cycle.example1 <- list.main$mean.cycle # this data set shows a moderate seasonality
summary.bin.example1 <- list.main$summary.bin # confirmed with SCI = 0.50
summary.outlier.example1 <- list.main$summary.outlier

plot(mean.cycle.example1[,'generic.time.bin1'],
     mean.cycle.example1[,'mean'],type='l',ylim=c(-80,80),
     ylab='sunspot cycle',
     xlab='11 years window')
lines(mean.cycle.example1[,'generic.time.bin1'],
      mean.cycle.example1[,'mean']+mean.cycle.example1[,'sd'],type='l',lty=2)
lines(mean.cycle.example1[,'generic.time.bin1'],
      mean.cycle.example1[,'mean']-mean.cycle.example1[,'sd'],type='l',lty=2)
title(paste0('mean cycle (moderate cyclicity: SCI = ',summary.bin.example1['SCI'],')'))
# plot tool:
ctbi.plot(list.main,show.n.bin=10)

# example of the beaver data
temp.beaver <- beaver1[,'temp']
t.char <- as.character(beaver1[,'time'])
minutes <- substr(t.char,nchar(t.char)-1,nchar(t.char))
hours <- substr(t.char,nchar(t.char)-3,nchar(t.char)-2)
hours[hours==""] <- '0'
days <- c(rep(12,91),rep(13,23))
time.beaver <- as.POSIXct(paste0('2000-12-',days,' ',hours,':',minutes,':00'),tz='UTC')
example2 <- data.frame(time=time.beaver,temp=temp.beaver)

bin.period <- '1 hour' # aggregation performed every hour
bin.side <- as.POSIXct('2000-12-12 00:00:00',tz='UTC') # give one side of a bin
bin.FUN <- 'mean' # aggregation operator
bin.max.f.NA <- 0.2 # maximum of 20\% of missing data per bin
ylim <- c(-Inf,Inf)
list.main <- ctbi(example2,bin.period=bin.period,
                 bin.side=bin.side,bin.FUN=bin.FUN,
                 ylim=ylim,bin.max.f.NA=bin.max.f.NA)
data0.example2 <- list.main$data0 # cleaned raw dataset
data1.example2 <- list.main$data1 # aggregated dataset.
lower.threshold <- list.main$summary.outlier['lower.outlier.threshold']
upper.threshold <- list.main$summary.outlier['upper.outlier.threshold']
hist(data0.example2[,'residuals'],xlim=c(-0.5,0.5),30,main='beaver residuals')
abline(v=c(lower.threshold,upper.threshold),col='red',lwd=2) # show the histogram of the residuals
}
