---
title: "fuzzy-clustering"
output:
  rmarkdown::html_vignette:
    number_sections: true
vignette: >
  %\VignetteIndexEntry{fuzzy-clustering}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

```{r setup}
library(fussclust)
```

# Introduction

The `fussclust` package provides methods for distance-based fuzzy clustering,
including both unsupervised and semi-supervised approaches.

The original motivation for developing the package was the lack of implementation of semi-supervised fuzzy clustering methods in `R`. However,
both unsupervised and semi-supervised fuzzy clustering share a common
optimization framework and estimation procedure. Consequently, `fussclust`
implements a unified estimation framework for fuzzy clustering models: both unsupervised and semi-supervised.

The current release of `fussclust` includes four classical models:

* Fuzzy c-Means (FCM),
* Semi-Supervised Fuzzy c-Means (SSFCM),
* Possibilistic c-Means (PCM),
* Semi-Supervised Possibilistic c-Means (SSPCM).

A typical workflow consists of the following steps:

1. Prepare a numerical feature matrix `X`.
2. For semi-supervised models, prepare a binary supervision matrix `superF`
   (see Section 4.1).
3. Fit the selected clustering model.
4. Inspect the estimated memberships `U` and prototypes `V`.

# Methodology

The fuzzy clustering methods implemented in `fussclust` are distance-based.
Membership values for $N$ observations and $C$ clusters are estimated under the
assumption that similarity can be represented by distances in a metric space:
objects located closer to each other are considered more similar.

The optimization problem is formulated as

$$
\min Q(U, V; X)
$$

where:

* $Q$ denotes the objective function,
* $U$ is the membership matrix,
* $V$ contains the cluster prototypes,
* $X$ is the feature matrix.

The optimization problem does not admit a closed-form solution. Therefore,
an iterative approximation procedure known as the Alternating Optimization (AO)
algorithm is used.

The AO procedure minimizes the corresponding partial objective functions,
yielding the matrix-valued update functions $\hat{U}$ for estimating
memberships and $\hat{V}$ for estimating cluster prototypes.

At each iteration, the algorithm alternates between computing the estimated
membership matrix $\tilde{U}$ using the update function $\hat{U}$ and
computing the estimated prototype matrix $\tilde{V}$ using the update
function $\hat{V}$.

For additional methodological details, see
[Kmita et al. (2024)](https://doi.org/10.1109/TFUZZ.2024.3370768).

The AO algorithm can be summarized as follows:

1. Initialize the algorithm with an initial membership matrix
   $U^{(0)}$.
2. Set the iteration counter $l = 1$.
3. Estimate prototypes $\tilde{V}^{(l)}$ using the formula for optimal
   prototypes $\hat{V}$ and current values of memberships.
4. Estimate memberships $\tilde{U}^{(l)}$ using the formula for optimal
   memberships $\hat{U}$ and current values of prototypes.
5. Check the convergence criterion. If the difference between
   $\tilde{U}^{(l)}$ and $\tilde{U}^{(l-1)}$ is smaller than a predefined
   threshold, stop the algorithm. Otherwise, increase $l$ by one and repeat
   Steps 3--5.

# Unsupervised models

The following examples use the well-known `iris` 
dataset [(R.A. Fisher, 1936)](https://doi.org/10.24432/C56C76). 
We consider two features: sepal length and sepal width.

## Preparing the data

`fussclust` requires the input to be of class `matrix`.

```{r}
X <- iris[, c("Sepal.Length", "Sepal.Width")] |> as.matrix()

cat(
  paste0(
    "Class of `iris`: ", class(iris),
    "; class of `X`: ",
    paste(class(X), collapse = " & ")
  )
)
```

The `iris` dataset contains three species of iris flowers. Therefore,
we set the number of clusters to `C = 3`.
In practice, however, the user may specify any number of clusters such that
$C > 1$.

```{r}
model_fcm <- fussclust::FCM(X = X, C = 3)
model_pcm <- fussclust::PCM(X = X, C = 3)
```

The estimated cluster prototypes obtained from the FCM model are:

```{r}
model_fcm$V
```

The estimated memberships for the first ten observations obtained from the PCM
model are:

```{r}
model_pcm$U[1:10, ]
```

# Semi-supervised models

## Encoding partial supervision

As described in Section 1, semi-supervised models in `fussclust` require
partial supervision encoded as a binary matrix `superF`.

The package assumes that the number of clusters $C$ equals the number of
distinct classes represented in the supervision information.

In many applications, class labels are stored in a response variable.
This is also the case for the `iris` dataset:

```{r}
iris[1:3, ]
```

The class labels are stored in the `Species` column:

```{r}
unique(iris$Species)
```

To construct the supervision matrix, the user must first define the arbitrary ordering
of the classes. Let

$$
\mathcal{Y} =
(y^1 = \texttt{setosa},
y^2 = \texttt{versicolor},
y^3 = \texttt{virginica})
$$

denote the ordered set of class labels.

The supervision matrix `superF` is defined as follows.
Let $F = [f_{jk}]$ denote the binary supervision matrix, where:

* $j = 1, \ldots, N$ indexes observations,
* $k = 1, \ldots, C$ indexes clusters.

The matrix satisfies the following properties:

* it has `nrow(X)` rows,
* it has $C$ columns,
* the $k$-th column corresponds to the $k$-th class in $\mathcal{Y}$,
* $f_{jk} = 1$ if the $j$-th observation is assumed to belong to the
  $k$-th class, and $0$ otherwise.

The matrix can be constructed as follows:

```{r}
superF <- matrix(0, nrow = nrow(iris), ncol = 3)

superF[iris$Species == "setosa", 1] <- 1
superF[iris$Species == "versicolor", 2] <- 1
superF[iris$Species == "virginica", 3] <- 1
```

## Fitting the models

When fitting a semi-supervised model, the user must provide the scaling
parameter $\alpha$ (`alpha`), which regulates the impact of partial supervision.

There is no universally optimal choice of $\alpha$, as the appropriate value
depends on the dataset and application. For further discussion, see
[Kmita et al. (2024)](https://doi.org/10.1109/TFUZZ.2024.3370768).

For this example, we set $\alpha = 1$.

Semi-Supervised Fuzzy c-Means:

```{r}
model_ssfcm <- fussclust::SSFCM(
  X = X,
  C = 3,
  alpha = 1,
  superF = superF
)
```

Semi-Supervised Possibilistic c-Means:

```{r}
model_sspcm <- fussclust::SSPCM(
  X = X,
  C = 3,
  alpha = 1,
  superF = superF
)
```

The estimated prototypes are:

```{r}
model_ssfcm$V
```

```{r}
model_sspcm$V
```

# Additional details

## The $\Gamma$ parameter in possibilistic models

The vector of cluster-specific hyperparameters

$$
\Gamma = (\gamma_1, \ldots, \gamma_C)
$$

is specific to possibilistic clustering methods.

This is an optional argument in both `fussclust::PCM()` and
`fussclust::SSPCM()`. If not provided, the default value is a unit vector:

$$
\Gamma = (1, \ldots, 1)
$$

Alternatively, the user may apply the initialization strategy proposed by
[Krishnapuram and Keller (1993)](https://doi.org/10.1109/91.227387),
which computes

$$
\gamma_k =
\frac{
\sum_{j=1}^N u_{jk}^2 d_{jk}^2
}{
\sum_{j=1}^N u_{jk}^2
}
$$

using values obtained from an initial FCM fit.

This strategy can be enabled by setting `initFCM = TRUE`.

```{r, eval = FALSE}
model_pcm_init <- fussclust::PCM(
  X = X,
  C = 3,
  initFCM = TRUE
)
```

The user may also choose to specify the values of $\Gamma$ manually, for example:

```{r, eval = FALSE}
model_pcm_custom_gammas <- fussclust::PCM(
  X = X,
  C = 3,
  gammas = c(1, 2, 3)
)
```

## Retrieving intermediate results

In addition to the final estimated memberships and prototypes
(accessible via `model$U` and `model$V`), users may inspect intermediate
results produced during the AO optimization procedure.

The following histories can be stored:

* membership matrices, accessible via `model$U_history`,
* prototype matrices, accessible via `model$V_history`,
* weight matrices, accessible via `model$Phi_history`.

By default, these objects are set to `NULL`.

```{r}
model_fcm$U_history
```

To store the optimization history, set `store_history = TRUE`.

```{r}
model_fcm_history <- fussclust::FCM(
  X = X,
  C = 3,
  store_history = TRUE
)

cat(
  paste0(
    "Class of `model_fcm_history$U_history`: ",
    class(model_fcm_history$U_history),
    "; length: ",
    length(model_fcm_history$U_history),
    "; number of AO iterations: ",
    model_fcm_history$counter
  )
)
```

To retrieve a specific intermediate result, for example the prototype matrix
$\hat{V}^{(3)}$ obtained at the third AO iteration, index the corresponding
history object directly:

```{r}
model_fcm_history$V_history[[3]]
```