% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/clusterMI.R
\name{clusterMI}
\alias{clusterMI}
\title{Cluster analysis and pooling after multiple imputation}
\usage{
clusterMI(
  output,
  method.clustering = "kmeans",
  method.consensus = "NMF",
  scaling = TRUE,
  nb.clust = NULL,
  Cboot = 50,
  method.hclust = "average",
  method.dist = "euclidean",
  modelNames = NULL,
  modelName.hc = "VVV",
  nstart.kmeans = 100,
  iter.max.kmeans = 10,
  m.cmeans = 2,
  samples.clara = 500,
  nnodes = 1,
  instability = TRUE,
  verbose = TRUE,
  nmf.threshold = 10^(-5),
  nmf.nstart = 100,
  nmf.early_stop_iter = 10,
  nmf.initializer = "random",
  nmf.batch_size = NULL,
  nmf.iter.max = 50
)
}
\arguments{
\item{output}{an output from the imputedata function}

\item{method.clustering}{a single string specifying the clustering algorithm used ("kmeans", "pam", "clara", "hclust" or "mixture","cmeans")}

\item{method.consensus}{a single string specifying the consensus method used to pool the contributory partitions ("NMF" or "CSPA")}

\item{scaling}{boolean. If TRUE, variables are scaled. Default value is TRUE}

\item{nb.clust}{an integer specifying the number of clusters}

\item{Cboot}{an integer specifying the number of bootstrap replications. Default value is 50}

\item{method.hclust}{character string defining the clustering method for hierarchical clustering (required only if method.clustering = "hclust")}

\item{method.dist}{character string defining the method use for computing dissimilarity matrices in hierarchical clustering (required only if method.clustering = "hclust")}

\item{modelNames}{character string indicating the models to be fitted in the EM phase of clustering (required only if method.clustering = "mixture"). By default modelNames = NULL.}

\item{modelName.hc}{A character string indicating the model to be used in model-based agglomerative hierarchical clustering.(required only if method.clustering = "mixture"). By default modelNames.hc = "VVV".}

\item{nstart.kmeans}{how many random sets should be chosen for kmeans initalization. Default value is 100 (required only if method.clustering = "kmeans")}

\item{iter.max.kmeans}{how many iterations should be chosen for kmeans. Default value is 10 (required only if method.clustering = "kmeans")}

\item{m.cmeans}{degree of fuzzification in cmeans clustering. By default m.cmeans = 2}

\item{samples.clara}{number of samples to be drawn from the dataset when performing clustering using clara algorithm. Default value is 500.}

\item{nnodes}{number of CPU cores for parallel computing. By default, nnodes = 1}

\item{instability}{a boolean indicating if cluster instability must be computed. Default value is TRUE}

\item{verbose}{a boolean. If TRUE, a message is printed at each step. Default value is TRUE}

\item{nmf.threshold}{Default value is 10^(-5),}

\item{nmf.nstart}{Default value is 100,}

\item{nmf.early_stop_iter}{Default value is 10,}

\item{nmf.initializer}{Default value is 'random',}

\item{nmf.batch_size}{Default value is 20,}

\item{nmf.iter.max}{Default value is 50}
}
\value{
A list with three objects
\item{part}{the consensus partition}
\item{instability}{a list of four objects: \code{U} the within instability measure for each imputed data set, \code{Ubar} the associated average, \code{B} the between instability measure, \code{Tot} the total instability measure}
\item{call}{the matching call}
}
\description{
From a list of imputed datasets \code{clusterMI} performs cluster analysis on each imputed data set, estimates the instability of each partition using bootstrap (following Fang, Y. and Wang, J., 2012 <doi:10.1016/j.csda.2011.09.003>) and pools results as proposed in Audigier and Niang (2022) <doi:10.1007/s11634-022-00519-1>.
}
\details{
\code{clusterMI} performs cluster analysis (according to the \code{method.clustering} argument) and pooling after multiple imputation. For achieving this goal, the \code{clusterMI} function uses as an input an output from the \code{imputedata} function and then
\enumerate{
 \item applies the cluster analysis method on each imputed data set
 \item pools contributory partitions using non-negative matrix factorization
 \item computes the instability of each partition by bootstrap
 \item computes the total instability
 }
 
Step 1 can be tuned by specifying the cluster analysis method used (\code{method.clustering} argument).
If \code{method.clustering = "kmeans"} or \code{"pam"}, then the number of clusters can be specified by tuning the \code{nb.clust} argument. By default, the same number as the one used for imputation is used.
The number of random initializations can also be tuned through the \code{nstart.kmeans} argument.
If \code{method.clustering = "hclust"} (hierarchical clustering), the method used can be specified (see \code{\link[stats]{hclust}}). By default \code{"average"} is used. Furthermore, the number of clusters can be specified, but it can also be automatically chosen if \code{nb.clust} < 0.
If \code{method.clustering = "mixture"} (model-based clustering using gaussian mixture models), the model to be fitted can be tuned by modifying the \code{modelNames} argument (see \code{\link[mclust]{Mclust}}).
If \code{method.clustering = "cmeans"} (clustering using the fuzzy c-means algorithm), then the fuzziness parameter can be modfied by tuning the\code{m.cmeans} argument. By default, \code{m.cmeans = 2}.

Step 2 performs consensus clustering by Non-Negative Matrix Factorization, following Li and Ding (2007) <doi:10.1109/ICDM.2007.98>.

Step 3 applies the \code{\link[fpc]{nselectboot}} function on each imputed data set and returns the instability of each cluster obtained at step 1. The method is based on bootstrap sampling, followong Fang, Y. and Wang, J. (2012) <doi:10.1016/j.csda.2011.09.003>. The number of iterations can be tuned using the \code{Cboot} argument.

Step 4 averages the previous instability measures given a within instability (\code{Ubar}), computes a between instability (\code{B}) and a total instability (\code{T} = B + Ubar). See Audigier and Niang (2022) <doi:10.1007/s11634-022-00519-1> for details.

All steps can be performed in parallel by specifying the number of CPU cores (\code{nnodes} argument). Steps 3 and 4 are more time consuming. To compute only steps 1 and 2 use \code{instability = FALSE}.
}
\examples{
data(wine)

require(parallel)
set.seed(123456)
ref <- wine$cult
nb.clust <- 3
m <- 5 # number of imputed data sets. Should be larger in practice
wine.na <- wine
wine.na$cult <- NULL
wine.na <- prodna(wine.na)

#imputation
res.imp <- imputedata(data.na = wine.na, nb.clust = nb.clust, m = m)
\donttest{
#analysis by kmeans and pooling
nnodes <- 2 # parallel::detectCores()
res.pool <- clusterMI(res.imp, nnodes = nnodes)

res.pool$instability
table(ref, res.pool$part)
}
}
\references{
Audigier, V. and Niang, N. (2022) Clustering with missing data: which equivalent for Rubin's rules? Advances in Data Analysis and Classification <doi:10.1007/s11634-022-00519-1>
  
  Fang, Y. and Wang, J. (2012) Selection of the number of clusters via the bootstrap method. Computational Statistics and Data Analysis, 56, 468-477. <doi:10.1016/j.csda.2011.09.003>
  
  T. Li, C. Ding, and M. I. Jordan (2007) Solving consensus and semi-supervised clustering problems using nonnegative matrix factorization.  In Proceedings of the 2007 Seventh IEEE International Conference on Data Mining, ICDM'07, page 577-582, USA. IEEE Computer Society. <doi:10.1109/ICDM.2007.98>
}
\seealso{
\code{\link[stats]{hclust}}, \code{\link[fpc]{nselectboot}}, \code{\link[mclust]{Mclust}}, \code{\link{imputedata}}, \code{\link[e1071]{cmeans}}
}
