% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/dtwclust.R
\name{dtwclust}
\alias{dtwclust}
\title{Time series clustering}
\usage{
dtwclust(data = NULL, type = "partitional", k = 2L, method = "average",
  distance = "dtw_basic", centroid = "pam", preproc = NULL, dc = NULL,
  control = NULL, seed = NULL, distmat = NULL, ...)
}
\arguments{
\item{data}{A list of series, a numeric matrix or a data frame. Matrices and data frames are
coerced row-wise.}

\item{type}{What type of clustering method to use: \code{"partitional"}, \code{"hierarchical"},
\code{"tadpole"} or \code{"fuzzy"}.}

\item{k}{Number of desired clusters. It may be a numeric vector with different values.}

\item{method}{Character vector with one or more linkage methods to use in hierarchical procedures
(see \code{\link[stats]{hclust}}) or a function that performs hierarchical clustering based on
distance matrices (e.g. \code{\link[cluster]{diana}}). See Hierarchical section for more
details.}

\item{distance}{A supported distance from \code{proxy}'s \code{\link[proxy]{dist}} (see Distance
section). Ignored for \code{type} = \code{"tadpole"}.}

\item{centroid}{Either a supported string or an appropriate function to calculate centroids when
using partitional or prototypes for hierarchical/tadpole methods. See Centroid section.}

\item{preproc}{Function to preprocess data. Defaults to \code{\link{zscore}} \emph{only} if
\code{centroid} \code{=} \code{"shape"}, but will be replaced by a custom function if provided.
See Preprocessing section.}

\item{dc}{Cutoff distance for the \code{\link{TADPole}} algorithm.}

\item{control}{Named list of parameters or \code{dtwclustControl} object for clustering
algorithms. See \code{\link{dtwclustControl}}. \code{NULL} means defaults.}

\item{seed}{Random seed for reproducibility.}

\item{distmat}{If a cross-distance matrix is already available, it can be provided here so it's
re-used. Only relevant if \code{centroid} = "pam" or \code{type} = "hierarchical". See
examples.}

\item{...}{Additional arguments to pass to \code{\link[proxy]{dist}} or a custom function
(preprocessing, centroid, etc.)}
}
\value{
An object with formal class \code{\link{dtwclust-class}}.

If \code{control@nrep > 1} and a partitional procedure is used, \code{length(method)} \code{> 1}
and hierarchical procedures are used, or \code{length(k)} \code{>} \code{1}, a list of objects is
returned.
}
\description{
This is the original main function to perform time series clustering. It supports partitional,
hierarchical, fuzzy, k-Shape and TADPole clustering. See \code{\link{tsclust}} for the new
interface. Please note that possible updates will only be implemented in the new function.
}
\details{
Partitional and fuzzy clustering procedures use a custom implementation. Hierarchical clustering
is done with \code{\link[stats]{hclust}}. TADPole clustering uses the \code{\link{TADPole}}
function. Specifying \code{type} = \code{"partitional"}, \code{distance} = \code{"sbd"} and
\code{centroid} = \code{"shape"} is equivalent to the k-Shape algorithm (Paparrizos and Gravano
2015).

The \code{data} may be a matrix, a data frame or a list. Matrices and data frames are coerced to
a list, both row-wise. Only lists can have series with different lengths or multiple dimensions.
Most of the optimizations require series to have the same length, so consider reinterpolating
them to save some time (see Ratanamahatana and Keogh 2004; \code{\link{reinterpolate}}). No
missing values are allowed.

In the case of multivariate time series, they should be provided as a list of matrices, where
time spans the rows of each matrix and the variables span the columns. At the moment, only
\code{DTW}, \code{DTW2} and \code{GAK} suppport such series, which means only partitional and
hierarchical procedures using those distances will work. You can of course create your own custom
distances. All included centroid functions should work with the aforementioned format, although
\code{shape} is \strong{not} recommended. Note that the \code{plot} method will simply append all
dimensions (columns) one after the other.

Several parameters can be adjusted with the \code{control} argument. See
\code{\link{dtwclustControl}}. In the following sections, elements marked with an asterisk (*)
are those that can be adjutsed with this argument.
}
\section{Partitional Clustering}{


  Stochastic algorithm that creates a hard partition of the data into \code{k} clusters, where
  each cluster has a centroid. In case of time series clustering, the centroids are also time
  series.

  The cluster centroids are first randomly initialized by selecting some of the series in the
  data. The distance between each series and each centroid is calculated, and the series are
  assigned to the cluster whose centroid is closest. The centroids are then updated by using a
  given rule, and the procedure is repeated until no series changes from one cluster to another,
  or until the maximum number of iterations* has been reached. The distance and centroid
  definitions can be specified through the corresponding parameters of this function. See their
  respective sections below.

  Note that it is possible for a cluster to become empty, in which case a new cluster is
  reinitialized randomly. However, if you see that the algorithm doesn't converge or the overall
  distance sum increases, it could mean that the chosen value of \code{k} is too large, or the
  chosen distance measure is not able to assess similarity effectively. The random
  reinitialization attempts to enforce a certain number of clusters, but can result in
  instability in the aforementioned cases.
}

\section{Fuzzy Clustering}{


  This procedure is very similar to partitional clustering, except that each series no longer
  belongs exclusively to one cluster, but belongs to each cluster to a certain degree. For each
  series, the total degree of membership across clusters must sum to 1.

  The default implementation uses the fuzzy c-means algorithm. In its definition, an objective
  function is to be minimized. The objective is defined in terms of a squared distance, which is
  usually the Euclidean (L2) distance, although the definition could be modified. The
  \code{distance} parameter of this function controls the one that is utilized. The fuzziness of
  the clustering can be controlled by means of the fuzziness exponent*. Bear in mind that the
  centroid definition of fuzzy c-means requires equal dimensions, which means that all series
  must have the same length. This problem can be circumvented by applying transformations to the
  series (see for example D'Urso and Maharaj (2009)).

  Note that the fuzzy clustering could be transformed to a crisp one by finding the highest
  membership coefficient. Some of the slots of the object returned by this function assume this,
  so be careful with interpretation (see \code{\link{dtwclust-class}}).
}

\section{Hierarchical Clustering}{


  This is (by default) a deterministic algorithm that creates a hierarchy of groups by using
  different linkage methods (see \code{\link[stats]{hclust}}). The linkage method is controlled
  through the \code{method} parameter of this function, which can be a character vector with
  several methods, with the additional option "all" that uses all of the available methods in
  \code{\link[stats]{hclust}}. The distance to be used can be controlled with the \code{distance}
  parameter.

  Optionally, \code{method} may be a \strong{function} that performs the hierarchical clustering
  based on a distance matrix, such as the functions included in package \pkg{cluster}. The
  function will receive the \code{dist} object as first argument (see
  \code{\link[stats]{as.dist}}), followed by the elements in \code{...} that match the its formal
  arguments. The object it returns must support the \code{\link[stats]{as.hclust}} generic so
  that \code{\link[stats]{cutree}} can be used. See the examples.

  The hierarchy does not imply a specific number of clusters, but one can be induced by cutting
  the resulting dendrogram (see \code{\link[stats]{cutree}}). This results in a crisp partition,
  and some of the slots of the returned object are calculated by cutting the dendrogram so that
  \code{k} clusters are created.
}

\section{TADPole Clustering}{


  TADPole clustering adopts a relatively new clustering framework and adapts it to time series
  clustering with DTW. Because of the way it works, it can be considered a kind of Partitioning
  Around Medoids (PAM). This means that the cluster centroids are always elements of the data.
  However, this algorithm is deterministic, depending on the value of the cutoff distance
  \code{dc}, which can be controlled with the corresponding parameter of this function.

  The algorithm relies on the DTW lower bounds, which are only defined for time series of equal
  length. Additionally, it requires a window constraint* for DTW. See the Sakoe-Chiba constraint
  section below. Unlike the other algorithms, TADPole always uses DTW2 as distance (with a
  symmetric1 step pattern).
}

\section{Centroid Calculation}{


  In the case of partitional/fuzzy algorithms, a suitable function should calculate the cluster
  centroids at every iteration. In this case, the centroids are themselves time series. Fuzzy
  clustering uses the standard fuzzy c-means centroid by default.

  In either case, a custom function can be provided. If one is provided, it will receive the
  following parameters with the shown names (examples for partitional clustering are shown in
  parenthesis):

  \itemize{
    \item \code{"x"}: The \emph{whole} data list (\code{list(ts1, ts2, ts3)})
    \item \code{"cl_id"}: A numeric vector with length equal to the number of series in
      \code{data}, indicating which cluster a series belongs to (\code{c(1L, 2L, 2L)})
    \item \code{"k"}: The desired number of total clusters (\code{2L})
    \item \code{"cent"}: The current centroids in order, in a list (\code{list(centroid1,
      centroid2)})
    \item \code{"cl_old"}: The membership vector of the \emph{previous} iteration (\code{c(1L,
      1L, 2L)})
    \item The elements of \code{...} that match its formal arguments
  }

  In case of fuzzy clustering, the membership vectors (2nd and 5th elements above) are matrices
  with number of rows equal to amount of elements in the data, and number of columns equal to the
  number of desired clusters. Each row must sum to 1.

  The other option is to provide a character string for the custom implementations. The following
  options are available:

  \itemize{
    \item "mean": The average along each dimension. In other words, the average of all
      \eqn{x^j_i} among the \eqn{j} series that belong to the same cluster for all time points
      \eqn{t_i}.
    \item "median": The median along each dimension. Similar to mean.
    \item "shape": Shape averaging. By default, all series are z-normalized in this case, since
      the resulting centroids will also have this normalization. See
      \code{\link{shape_extraction}} for more details.
    \item "dba": DTW Barycenter Averaging. See \code{\link{DBA}} for more details.
    \item "pam": Partition around medoids (PAM). This basically means that the cluster centroids
      are always one of the time series in the data. In this case, the distance matrix can be
      pre-computed once using all time series in the data and then re-used at each iteration. It
      usually saves overhead overall.
    \item "fcm": Fuzzy c-means. Only supported for fuzzy clustering and used by default in that
      case.
    \item "fcmdd": Fuzzy c-medoids. Only supported for fuzzy clustering. It \strong{always}
      precomputes the whole cross-distance matrix.
  }

  These check for the special cases where parallelization might be desired. Note that only
  \code{shape}, \code{dba} and \code{pam} support series of different length. Also note that, for
  \code{shape} and \code{dba}, this support has a caveat: the final centroids' length will depend
  on the length of those series that were randomly chosen at the beginning of the clustering
  algorithm. For example, if the series in the dataset have a length of either 10 or 15, 2
  clusters are desired, and the initial choice selects two series with length of 10, the final
  centroids will have this same length.

  As special cases, if hierarchical or tadpole clustering is used, you can provide a centroid
  function that takes a list of series as only input and returns a single centroid series. These
  centroids are returned in the \code{centroids} slot. By default, a type of PAM centroid
  function is used.
}

\section{Distance Measures}{


  The distance measure to be used with partitional, hierarchical and fuzzy clustering can be
  modified with the \code{distance} parameter. The supported option is to provide a string, which
  must represent a compatible distance registered with \code{proxy}'s \code{\link[proxy]{dist}}.
  Registration is done via \code{\link[proxy]{pr_DB}}, and extra parameters can be provided in
  \code{...}.

  Note that you are free to create your own distance functions and register them. Optionally, you
  can use one of the following custom implementations (all registered with \code{proxy}):

  \itemize{
    \item \code{"dtw"}: DTW, optionally with a Sakoe-Chiba/Slanted-band constraint*.
    \item \code{"dtw2"}: DTW with L2 norm and optionally a Sakoe-Chiba/Slanted-band constraint*.
      Read details below.
    \item \code{"dtw_basic"}: A custom version of DTW with less functionality, but slightly
      faster. See \code{\link{dtw_basic}}.
    \item \code{"dtw_lb"}: DTW with L1 or L2 norm* and optionally a Sakoe-Chiba constraint*. Some
      computations are avoided by first estimating the distance matrix with Lemire's lower bound
      and then iteratively refining with DTW. See \code{\link{dtw_lb}}. Not suitable for
      \code{pam.precompute}* = \code{TRUE}.
    \item \code{"lbk"}: Keogh's lower bound with either L1 or L2 norm* for the Sakoe-Chiba
      constraint*.
    \item \code{"lbi"}: Lemire's lower bound with either L1 or L2 norm* for the Sakoe-Chiba
      constraint*.
    \item \code{"sbd"}: Shape-based distance. See \code{\link{SBD}} for more details.
    \item \code{"gak"}: Global alignment kernels. See \code{\link{GAK}} for more details.
  }

  DTW2 is done with \code{\link[dtw]{dtw}}, but it differs from the result you would obtain if
  you specify \code{L2} as \code{dist.method}: with \code{DTW2}, pointwise distances (the local
  cost matrix) are calculated with \code{L1} norm, \emph{each} element of the matrix is squared
  and the result is fed into \code{\link[dtw]{dtw}}, which finds the optimum warping path. The
  square root of the resulting distance is \emph{then} computed. See \code{\link{dtw2}}.

  Only \code{dtw}, \code{dtw2}, \code{sbd} and \code{gak} support series of different length. The
  lower bounds are probably unsuitable for direct clustering unless series are very easily
  distinguishable.

  If you create your own distance, register it with \code{proxy}, and it includes the ellipsis
  (\code{...}) in its definition, it will receive the following parameters*:

  \itemize{
    \item \code{window.type}: Either \code{"none"} for a \code{NULL} \code{window.size}, or
      \code{"slantedband"} otherwise
    \item \code{window.size}: The provided window size
    \item \code{norm}: The provided desired norm
    \item \code{...}: Any additional parameters provided in the original call's ellipsis
  }

  Whether your function makes use of them or not, is up to you.

  If you know that the distance function is symmetric, and you use a hierarchical algorithm, or a
  partitional algorithm with PAM centroids and \code{pam.precompute}* = \code{TRUE}, some time
  can be saved by calculating only half the distance matrix. Therefore, consider setting the
  symmetric* control parameter to \code{TRUE} if this is the case.
}

\section{Sakoe-Chiba Constraint}{


  A global constraint to speed up the DTW calculations is the Sakoe-Chiba band (Sakoe and Chiba,
  1978). To use it, a window size* must be defined.

  The windowing constraint uses a centered window. The function expects a value in
  \code{window.size} that represents the distance between the point considered and one of the
  edges of the window. Therefore, if, for example, \code{window.size = 10}, the warping for an
  observation \eqn{x_i} considers the points between \eqn{x_{i-10}} and \eqn{x_{i+10}}, resulting
  in \code{10(2) + 1 = 21} observations falling within the window.

  The computations actually use a \code{slantedband} window, which is equivalent to the
  Sakoe-Chiba one if series have equal length, and stays along the diagonal of the local cost
  matrix if series have different length.
}

\section{Preprocessing}{


  It is strongly advised to use z-normalization in case of \code{centroid = "shape"}, because the
  resulting series have this normalization (see \code{\link{shape_extraction}}). Therefore,
  \code{\link{zscore}} is the default in this case. The user can, however, specify a custom
  function that performs any transformation on the data, but the user must make sure that the
  format stays consistent, i.e. a list of time series.

  Setting to \code{NULL} means no preprocessing (except for \code{centroid = "shape"}). A
  provided function will receive the data as first argument, followed by the contents of
  \code{...} that match its formal arguments.

  It is convenient to provide this function if you're planning on using the
  \code{\link[stats]{predict}} generic.
}

\section{Repetitions}{


  Due to their stochastic nature, partitional clustering is usually repeated* several times with
  different random seeds to allow for different starting points. This function uses
  \code{\link[rngtools]{RNGseq}} to obtain different seed streams for each repetition, utilizing
  the \code{seed} parameter (if provided) to initialize it. If more than one repetition is made,
  the streams are returned in an attribute called \code{rng}.

  Technically, you can also perform random repetitions for fuzzy clustering, although it might be
  difficult to evaluate the results, since they are usually evaluated relative to each other and
  not in an absolute way. Ideally, the groups wouldn't change too much once the algorithm
  converges.

  Multiple values of \code{k} can also be provided to get different partitions using any
  \code{type} of clustering.

  Repetitions are greatly optimized when PAM centroids are used and the whole distance matrix is
  precomputed*, since said matrix is reused for every repetition, and can be comptued in parallel
  (see Parallel section).
}

\section{Parallel Computing}{


  Please note that running tasks in parallel does \strong{not} guarantee faster computations. The
  overhead introduced is sometimes too large, and it's better to run tasks sequentially.

  The user can register a parallel backend, for eample with the \code{doParallel} package, in
  order to do the repetitions in parallel, as well as distance and some centroid calculations.

  Unless each repetition requires a few seconds, parallel computing probably isn't worth it. As
  such, I would only use this feature with \code{shape} and \code{DBA} centroids, or an expensive
  distance function like \code{DTW}.

  If you register a parallel backend, the function will also try to do the calculation of the
  distance matrices in parallel. This should work with any function registered with
  \code{\link[proxy]{dist}} via \code{\link[proxy]{pr_DB}} whose \code{loop} flag is set to
  \code{TRUE}. If the function requires special packages to be loaded, provide their names in the
  \code{packages}* slot of \code{control}. Note that "dtwclust" is always loaded in each parallel
  worker, so that doesn't need to be included. Alternatively, you may want to pre-load
  \code{dtwclust} in each worker with \code{\link[parallel]{clusterEvalQ}}.

  In case of multiple repetitions, each worker gets a repetition task. Otherwise, the tasks
  (which can be a distance matrix or a centroid calculation) are usually divided into chunks
  according to the number of workers available.
}

\section{Notes}{


  The lower bounds are defined only for time series of equal length. \code{DTW} and \code{DTW2}
  don't require this, but they are much slower to compute.

  The lower bounds are \strong{not} symmetric, and \code{DTW} is not symmetric in general.
}

\examples{
#' NOTE: More examples are available in the vignette. Here are just some miscellaneous
#' examples that might come in handy. They should all work, but some don't run
#' automatically.

# Load data
data(uciCT)

# ====================================================================================
# Simple partitional clustering with Euclidean distance and PAM centroids
# ====================================================================================

# Reinterpolate to same length
series <- reinterpolate(CharTraj, new.length = max(lengths(CharTraj)))

# Subset for speed
series <- series[1:20]
labels <- CharTrajLabels[1:20]

# Making many repetitions
pc.l2 <- dtwclust(series, k = 4L,
                  distance = "L2", centroid = "pam",
                  seed = 3247, control = list(trace = TRUE,
                                              nrep = 10L))

# Cluster validity indices
sapply(pc.l2, cvi, b = labels)

# ====================================================================================
# Hierarchical clustering with Euclidean distance
# ====================================================================================

# Re-use the distance matrix from the previous example (all matrices are the same)
# Use all available linkage methods for function hclust
hc.l2 <- dtwclust(series, type = "hierarchical",
                  k = 4L, method = "all",
                  control = list(trace = TRUE),
                  distmat = pc.l2[[1L]]@distmat)

# Plot the best dendrogram according to variation of information
plot(hc.l2[[which.min(sapply(hc.l2, cvi, b = labels, type = "VI"))]])

# ====================================================================================
# Multivariate time series
# ====================================================================================

# Multivariate series, provided as a list of matrices
mv <- CharTrajMV[1L:20L]

# Using GAK distance
mvc <- dtwclust(mv, k = 4L, distance = "gak", seed = 390)

# Note how the variables of each series are appended one after the other in the plot
plot(mvc)

\dontrun{
# Common controls
ctrl <- new("dtwclustControl", trace = TRUE, window.size = 18L)

# ====================================================================================
# Registering a custom distance with the 'proxy' package and using it
# ====================================================================================

# Normalized asymmetric DTW distance
ndtw <- function(x, y, ...) {
    dtw::dtw(x, y, step.pattern = asymmetric,
             distance.only = TRUE, ...)$normalizedDistance
}

# Registering the function with 'proxy'
if (!pr_DB$entry_exists("nDTW"))
    proxy::pr_DB$set_entry(FUN = ndtw, names=c("nDTW"),
                           loop = TRUE, type = "metric", distance = TRUE,
                           description = "Normalized asymmetric DTW")

# Subset of (original) data for speed
pc.ndtw <- dtwclust(series[-1L], k = 4L,
                    distance = "nDTW",
                    seed = 8319,
                    control = ctrl)

# Which cluster would the first series belong to?
# Notice that newdata is provided as a list
predict(pc.ndtw, newdata = series[1L])

# ====================================================================================
# Custom hierarchical clustering
# ====================================================================================

require(cluster)

hc.diana <- dtwclust(series, type = "h", k = 4L,
                     distance = "L2", method = diana,
                     control = ctrl)

plot(hc.diana, type = "sc")

# ====================================================================================
# TADPole clustering
# ====================================================================================

pc.tadp <- dtwclust(series, type = "tadpole", k = 4L,
                    dc = 1.5, control = ctrl)

# Modify plot, show only clusters 3 and 4
plot(pc.tadp, clus = 3:4,
     labs.arg = list(title = "TADPole, clusters 3 and 4",
                     x = "time", y = "series"))

# Saving and modifying the ggplot object with custom time labels
require(scales)
t <- seq(Sys.Date(), len = length(series[[1L]]), by = "day")
gpc <- plot(pc.tadp, time = t, plot = FALSE)

gpc + scale_x_date(labels = date_format("\%b-\%Y"),
                   breaks = date_breaks("2 months"))

# ====================================================================================
# Specifying a centroid function for prototype extraction in hierarchical clustering
# ====================================================================================

# Seed is due to possible randomness in shape_extraction when selecting a basis series
hc.sbd <- dtwclust(CharTraj, type = "hierarchical",
                   k = 20L, distance = "sbd",
                   preproc = zscore, centroid = shape_extraction,
                   seed = 320L)

plot(hc.sbd, type = "sc")

# ====================================================================================
# Using parallel computation to optimize several random repetitions
# and distance matrix calculation
# ====================================================================================
require(doParallel)

# Create parallel workers
cl <- makeCluster(detectCores())
invisible(clusterEvalQ(cl, library(dtwclust)))
registerDoParallel(cl)

## Use constrained DTW and PAM (less than a second with 8 cores)
pc.dtw <- dtwclust(CharTraj, k = 20L, seed = 3251, control = ctrl)

## Use constrained DTW with DBA centroids (~3 seconds with 8 cores)
pc.dba <- dtwclust(CharTraj, k = 20L, centroid = "dba", seed = 3251, control = ctrl)

#' Using distance based on global alignment kernels
#' (~35 seconds with 8 cores for all repetitions)
pc.gak <- dtwclust(CharTraj, k = 20L,
                   distance = "gak",
                   centroid = "dba",
                   seed = 8319,
                   control = list(trace = TRUE,
                                  window.size = 18L,
                                  nrep = 8L))

# Stop parallel workers
stopCluster(cl)

# Return to sequential computations. This MUST be done after stopCluster()
registerDoSEQ()
}
}
\references{
Please refer to the package vignette references.
}
\seealso{
\code{\link{dtwclust-methods}}, \code{\link{dtwclust-class}}, \code{\link{dtwclustControl}},
\code{\link{dtwclustFamily}}.
}
\author{
Alexis Sarda-Espinosa
}
