% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/detect_outliers.R
\name{detect_outliers}
\alias{detect_outliers}
\title{Detects unreliable outliers in univariate time series data based on
model-based clustering}
\usage{
detect_outliers(
  data,
  S,
  proba = 0.5,
  share = NULL,
  repetitions = 10,
  decomp = T,
  PComp = F,
  detection.parameter = 1,
  out.par = 2,
  max.cluster = 9,
  G = NULL,
  modelName = "VVV",
  feat.inf = F,
  ...
)
}
\arguments{
\item{data}{an one dimensional matrix or data frame without missing data;
each row is an observation.}

\item{S}{vector with numeric values for each seasonality present in data.}

\item{proba}{denotes the threshold from which on an observation is considered
as being outlying data. By default is set to 0.5 (ranging from 0 to 1). Number of
outliers increases with decrease of proba threshold.}

\item{share}{controlls the size of the subsample used for estimation.
By default set to pmin(2*round(length(data)^(sqrt(2)/2)),
length(data))/length(data) (ranging from 0 to 1).
In combination with the repetitions parameter the
robustness and computational time of the method can be controlled.}

\item{repetitions}{denotes the number of
repetitions to repeat the clustering.
By default set to 10. Allows to control the robustness and computational time
of the method.}

\item{decomp}{allows to perform seasonal decomposition on the original time series as pre-
processing step before feature modelling. By default set to TRUE.}

\item{PComp}{allows to use the principal components of the modelled feature matrix.
By default set to FALSE.}

\item{detection.parameter}{denotes a parameter to regulate the
detection sensitivity. By default set to 1. It is assumed that the outlier cluster
follows a (multivariate) Gaussian distribution parameterized by sample mean and a blown up
sample covariance matrix of the feature space. The covariance matrix is blown up
by detection.parameter * (2 * log(length(data)))^2.
By increase the more extrem outliers are detected.}

\item{out.par}{controls the number of artifially produced outliers to allow cluster
formation of oultier cluster. By default out.par ist set to 2. By increase it is assumed that
share of outliers in data increases. A priori it is assumed that
out.par * ceiling(sqrt(nrow(data.original))) number of observations are outlying observations.}

\item{max.cluster}{a single numeric value controlling the maximum
number of allowed clusters. By default set to 9.}

\item{G}{denotes the optimal number of clusters limited by the
max.cluster paramter. By default G is set to NULL and is automatically
calculated based on the BIC.}

\item{modelName}{denotes the geometric features of the covariance matrix.
i.e. "EII", "VII", "EEI", "EVI", "VEI", "VVI", etc.. By default modelName
is set to "VVV". The help file for \link[mclust]{mclustModelNames} describes
the available models. Choice of modelName influences the fit to the data as well as
the computational time.}

\item{feat.inf}{logical value indicating whether influential features/ feature combinations
should be computed. By default set to FALSE.}

\item{...}{additional arguments for the \link[mclust]{Mclust} function.}
}
\value{
a list containing the following elements:
\item{data}{numeric vector containing the original data.}
\item{outlier.pos}{a logical vector indicating the position of each outlier.}
\item{outlier.probs}{a vector containing all probabilities for each observation
being outlying data.}
\item{Repetitions}{provides a list for each repetition containing the estimated model,
the outlier cluster, the probabilities for each observation belonging to the estimated
clusters, the outlier position, the influence of each feature/ feature combination
on the identified outyling data, and the corresponding probabilities
after shift to the feature mean of each considered outlier, as well as the applied
subset of the extended feature matrix for estimation (including artificially introduced
outliers).
}
\item{features}{a matrix containg the feature matrix. Each column is a feature.}
\item{inf.feature.combinations}{a list containg the features/ feature comibinations,
which caused assignment to outlier cluster.}
\item{feature.inf.tab}{a matrix containing all possible feature combinations.}
\item{PC}{an object of class "princomp" containing the principal component analysis
of the feature matrix.}
}
\description{
This function applies finite mixture modelling to compute
the probability of each observation being outliying data
in an univariate time series.
By utilizing the \link[mclust]{Mclust} package the data is
assigned in G clusters whereof one is modelled as an outlier cluster.
The clustering process is based on features, which are modelled to
differentiate normal from outlying observation.Beside computing
the probability of each observation being outlying data also
the specific cause in terms of the responsible feature/ feature combination
can be provided.
}
\details{
The detection of outliers is addressed by
model based clustering based on parameterized finite Gaussian mixture models.
For cluster estimation the \link[mclust]{Mclust} function is applied.
Models are estimated by the EM algorithm initialized by hierarchical
model-based agglomerative clustering. The optimal model is selected
according to BIC.
The following features based on the introduced data are used in the clustering process:
\describe{
\item{org.series}{denotes the scaled and potantially decomposed original time series.}
\item{seasonality}{denotes determenistic seasonalities based on S.}
\item{gradient}{denotes the summation of the two sided gradient of the org.series.}
\item{abs.gradient}{denotes the summation of the absolute two sided gradient of
org.series.}
\item{rel.gradient}{denotes the summation of the two sided absolute gradient of the
org.series with sign based on left sided gradient in relation to the
rolling mean absolut deviation based on most relevant seasonality S.}
\item{abs.seas.grad}{denotes the summation of the absolute two sided seasonal gradient of
org.series based on seasonalties S.}
}
In case PComp = TRUE, the features correspond to the principal components of the
introduced feature space.
}
\examples{
\dontrun{
set.seed(1)
id <- 14000:17000
# Replace missing values
modelmd <- model_missing_data(data = GBload[id, -1], tau = 0.5,
                             S = c(48, 336), indices.to.fix = seq_len(nrow(GBload[id, ])),
                             consider.as.missing = 0, min.val = 0)
# Impute missing values
data.imputed <- impute_modelled_data(modelmd)

#Detect outliers
system.time(
 o.ident <- detect_outliers(data = data.imputed, S = c(48, 336))
)

# Plot of identified outliers in time series
outlier.vector <- rep(F,length(data.imputed))
outlier.vector[o.ident$outlier.pos] <- T
plot(data.imputed, type = "o", col=1 + outlier.vector,
    pch = 1 + 18 * outlier.vector)

# table of identified outliers and corresponding probs being outlying data
df <- data.frame(o.ident$outlier.pos,unlist(o.ident$outlier.probs)[o.ident$outlier.pos])
colnames(df) <- c("Outlier position", "Probability of being outlying data")
df

# Plot of feature matrix
plot.ts(o.ident$features, type = "o",
       col = 1 + outlier.vector,
       pch = 1 + 1 * outlier.vector)

}
}
\seealso{
\code{\link[tsrobprep]{model_missing_data}},
\link[tsrobprep]{impute_modelled_data},
\link[tsrobprep]{auto_data_cleaning}
}
