% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/kccaExtendedFamily.R
\name{kccaExtendedFamily}
\alias{kccaExtendedFamily}
\title{Extending K-Centroids Clustering to (Mixed-with-)Ordinal Data}
\usage{
kccaExtendedFamily(which = c('kModes', 'kGDM2', 'kGower'),
                   cent = NULL,
                   preproc = NULL,
                   xrange = NULL,
                   xmethods = NULL,
                   trim = 0, groupFun = 'minSumClusters')
}
\arguments{
\item{which}{One of either \code{'kModes'}, \code{'kGDM2'} or \code{'kGower'}, the three
predefined methods for K-centroids clustering.  For more
information on each of them, see the Details section.}

\item{cent}{Function for determining cluster centroids.

\if{html}{\out{<div class="sourceCode">}}\preformatted{This argument is ignored for `which='kModes'`, and `centMode`
is used.  For `'kGDM2'` and `'kGower'`, `cent=NULL` defaults to
a general purpose optimizer.
}\if{html}{\out{</div>}}}

\item{preproc}{Preprocessing function applied to the data before
clustering.

\if{html}{\out{<div class="sourceCode">}}\preformatted{This argument is ignored for `which='kGower'`. In this case,
the default preprocessing proposed by Gower (1971) and Kaufman
& Rousseeuw (1990) is conducted. For `'kGDM2'` and `'kModes'`,
users can specify preprocessing steps here, though this is not
recommended.
}\if{html}{\out{</div>}}}

\item{xrange}{The range of the data in \code{x}. Options are:
\itemize{
\item \code{"all"}: uses the same minimum and maximum value for each column
of \code{x} by determining the whole range of values in the data
object \code{x}.
\item \code{"columnwise"}: uses different minimum and maximum values for
each column of \code{x} by determining the columnwise ranges of
values in the data object \code{x}.
\item A vector of \code{c(min, max)}: specifies the same minimum and maximum
value for each column of \code{x}.
\item A list of vectors \code{list(c(min1, max1), c(min2, max2),...)} with
length \code{ncol(x)}: specifies different minimum and maximum
values for each column of \code{x}.
}

This argument is ignored for \code{which='kModes'}. \code{xrange=NULL}
defaults to \code{"all"} for \code{'kGDM2'}, and to \code{"columnwise"} for
\code{'kGower'}.}

\item{xmethods}{An optional character vector of length \code{ncol(x)}
that specifies the distance measure for each column of
\code{x}. Currently only used for \code{'kGower'}. For \code{'kGower'},
\code{xmethods=NULL} results in the use of default methods for each
column of \code{x}. For more information on allowed input values,
and default measures, see the Details section.}

\item{trim}{Proportion of points trimmed in robust clustering, wee
\code{\link[flexclust:kcca]{flexclust::kccaFamily()}}.}

\item{groupFun}{A character string specifying the function for
clustering.

\if{html}{\out{<div class="sourceCode">}}\preformatted{Default is `'minSumClusters'`, see [flexclust::kccaFamily()].
}\if{html}{\out{</div>}}}
}
\value{
An object of class \code{"kccaFamily"}.
}
\description{
This wrapper creates objects of class \code{"kccaFamily"},
which can be used with \code{\link[flexclust:kcca]{flexclust::kcca()}} to conduct K-centroids
clustering using the following methods:
\itemize{
\item \strong{kModes} (after Weihs et al., 2005)
\item \strong{kGower} (Gower's distance after Kaufman & Rousseeuw, 1990,
and a user specified centroid)
\item \strong{kGDM2} (GDM2 distance after Walesiak et al., 1993, and a
user specified centroid)
}
}
\details{
\strong{Wrappers} for defining families are obtained by specifying
\code{which} using:
\itemize{
\item \code{which='kModes'} creates an object for \strong{kModes} clustering,
i.e., K-centroids clustering using Simple Matching Distance
(counts of disagreements) and modes as centroids.  Argument
\code{cent} is ignored for this method.
\item \code{which='kGower'} creates an object for performing clustering
using Gower's method as described in Kaufman & Rousseeuw (1990):
\itemize{
\item Numeric and/or ordinal variables are scaled by
\eqn{\frac{\mathbf{x}-\min{\mathbf{x}}}{\max{\mathbf{x}-\min{\mathbf{x}}}}}.
Note that for ordinal variables the internal coding with values
from 1 up to their maximum level is used.
\item Distances are calculated for each column (Euclidean distance,
\code{distEuclidean}, is recommended for numeric, Manhattan
distance, \code{distManhattan} for ordinal, Simple Matching
Distance, \code{distSimMatch} for categorical, and Jaccard distance,
\code{distJaccard} for asymmetric binary variables), and they are
summed up as:

\deqn{d(x_i, x_k) = \frac{\sum_{j=1}^p \delta_{ikj} d(x_{ij},
    x_{kj})}{\sum_{j=1}^p \delta_{ikj}}}

where \eqn{p} is the number of variables and with the weight
\eqn{\delta_{ikj}} being 1 if both values \eqn{x_{ij}} and
\eqn{x_{kj}} are not missing, and in the case of asymmetric
binary variables, at least one of them is not 0.

The columnwise distances used can be influenced in two ways: By
passing a character vector of length \eqn{p} to \code{xmethods} that
specifies the distance for each column.  Options are:
\code{distEuclidean}, \code{distManhattan}, \code{distJaccard}, and
\code{distSimMatch}.  Another option is to not specify any methods
within \code{kccaExtendedFamily}, but rather pass a \code{"data.frame"}
as argument \code{x} in \code{kcca}, where the class of the column is
used to infer the distance measure. \code{distEuclidean} is used on
numeric and integer columns, \code{distManhattan} on columns that
are coded as ordered factors, \code{distSimMatch} is the default for
categorically coded columns, and \code{distJaccard} is the default
for binary coded columns.

For this method, if \code{cent=NULL}, a general purpose optimizer
with \code{NA} omission is applied for centroid calculation.
}
\item \code{which='kGDM2'} creates an obejct for clustering using the GDM2
distance for ordinal variables. The GMD2 distance was first
introduced by Walesiak et al. (1993), and adapted in Ernst et
al. (2025), as the distance measure within \code{\link[flexclust:kcca]{flexclust::kcca()}}.

This distance respects the ordinal nature of a variable by
conducting only relational operations to compare values, such as
\eqn{\leq}, \eqn{\geq} and \eqn{=}. By obtaining the relative
frequencies and empirical cumulative distributions of \eqn{x}, we
allow for comparison of two arbitrary values, and thus are able
to conduct K-centroids clustering. For more details, see Ernst et
al. (2025).
}

Also for this method, if \code{cent=NULL}, a general purpose optimizer
with \code{NA} omission will be applied for centroid calculation.

\strong{Scale handling}.
In \code{'kModes'}, all variables are treated as unordered factors.
In \code{'kGDM2'}, all variables are treated as ordered factors, with strict assumptions
regarding their ordinality.
\code{'kGower'} is currently the only method designed to handle mixed-type data. For ordinal
variables, the assumptions are more lax than with GDM2 distance.

\strong{NA handling}.
NA handling via omission and upweighting non-missing variables is currently
only implemented for \code{'kGower'}. Within \code{'kModes'}, the omission of NA responses
can be avoided by coding missings as valid factor levels. For \code{'kGDM2'}, currently
the only option is to omit missing values completely.
}
\examples{
# Example 1: kModes
set.seed(123)
dat <- data.frame(cont = sample(1:100, 10, replace=TRUE)/10,
                  bin_sym = as.logical(sample(0:1, 10, replace=TRUE)),
                  bin_asym = as.logical(sample(0:1, 10, replace=TRUE)),                     
                  ord_levmis = factor(sample(1:5, 10, replace=TRUE),
                                      levels=1:6, ordered=TRUE),
                  ord_levfull = factor(sample(1:4, 10, replace=TRUE),
                                       levels=1:4, ordered=TRUE),
                  nom = factor(sample(letters[1:4], 10, replace=TRUE),
                               levels=letters[1:4]))
flexclust::kcca(dat, k=3, family=kccaExtendedFamily('kModes'))

# Example 2: kGDM2
flexclust::kcca(dat, k=3, family=kccaExtendedFamily('kGDM2',
                                                    xrange='columnwise'))
# Example 3: kGower
flexclust::kcca(dat, 3, kccaExtendedFamily('kGower'))
nas <- sample(c(TRUE,FALSE), prod(dim(dat)), replace=TRUE, prob=c(0.1,0.9)) |> 
   matrix(nrow=nrow(dat))
dat[nas] <- NA
flexclust::kcca(dat, 3, kccaExtendedFamily('kGower',
                                           xrange='all'))
flexclust::kcca(dat, 3, kccaExtendedFamily('kGower',
                                           xmethods=c('distEuclidean',
                                                      'distEuclidean',
                                                      'distJaccard',
                                                      'distManhattan',
                                                      'distManhattan',
                                                      'distSimMatch')))
#the case where column 2 is a binary variable, but is symmetric

}
\references{
\itemize{
\item Ernst, D, Ortega Menjivar, L, Scharl, T, Grün, B (2025).
\emph{Ordinal Clustering with the flex-Scheme.}
Austrian Journal of Statistics. \emph{Submitted manuscript}.
\item Gower, JC (1971).
\emph{A General Coefficient for Similarity and Some of Its Properties.}
Biometrics, 27(4), 857-871.
\doi{10.2307/2528823}
\item Kaufman, L, Rousseeuw, P (1990).
\emph{Finding Groups in Data: An Introduction to Cluster Analysis.}
Wiley Series in Probability and Statistics.
\doi{10.1002/9780470316801}
\item Leisch, F (2006). \emph{A Toolbox for K-Centroids Cluster Analysis.}
Computational Statistics and Data Analysis, 17(3), 526-544.
\doi{10.1016/j.csda.2005.10.006}
\item Walesiak, M (1993). \emph{Statystyczna Analiza Wielowymiarowa w Badaniach Marketingowych.}
Wydawnictwo Akademii Ekonomicznej, 44-46.
\item Weihs, C, Ligges, U, Luebke, K, Raabe, N (2005).
\emph{klaR Analyzing German Business Cycles.} In: Data Analysis and
Decision Support, Springer: Berlin. 335-343.
\doi{10.1007/3-540-28397-8_36}
}
}
\seealso{
\code{\link[flexclust:kcca]{flexclust::kcca()}},
\code{\link[flexclust:stepFlexclust]{flexclust::stepFlexclust()}},
\code{\link[flexclust:bootFlexclust]{flexclust::bootFlexclust()}}
}
