\name{mvr}

\alias{mvr}

\title{Function for Mean-Variance Regularization and Variance Stabilization}

\description{
    End-user function for Mean-Variance Regularization (MVR) and Variance Stabilization by similarity statistic
    under sample group homoscedasticity or heteroscedasticity assumptions.

    Return an object of class "\code{mvr}". Offers the option of parallel computation for improved efficiency.
}

\usage{
    mvr(data,
        block = rep(1,nrow(data)),
        tolog = FALSE,
        nc.min = 1,
        nc.max = 30,
        probs = seq(0, 1, 0.01),
        B = 100,
        parallel = FALSE,
        conf = NULL,
        verbose = TRUE, 
        seed = NULL)
}

\arguments{
  \item{data}{\code{numeric} \code{matrix} of untransformed (raw) data,
              where samples are by rows and variables (to be clustered) are by columns,
              or an object that can be coerced to such a \code{matrix} (such as a \code{numeric} \code{vector}
              or a \code{data.frame} with all \code{numeric} columns).
              Missing values (\code{NA}), NotANumber values (\code{NaN}) or Infinite values (\code{Inf}) are not allowed.}
  \item{block}{\code{character} or \code{numeric} \code{vector}, or \code{factor} of group membership indicator variable 
              (grouping/blocking variable) of length the data sample size with as many different values or \code{levels} 
              as the number of data sample groups. Defaults to single group situation. See details.}
  \item{tolog}{\code{logical} scalar. Is the data to be log2-transformed first? Optional, defaults to \code{FALSE}.
             Note that negative or null values will be changed to 1 before taking log2-transformation.}
  \item{nc.min}{Positive \code{integer} scalar of the minimum number of clusters, defaults to 1}
  \item{nc.max}{Positive \code{integer} scalar of the maximum number of clusters, defaults to 30}
  \item{probs}{\code{numeric} \code{vector} of probabilities for quantile diagnostic plots. Defaults to \code{seq}(0, 1, 0.01).}
  \item{B}{Positive \code{integer} scalar of the number of Monte Carlo replicates of the inner loop
           of the sim statistic function (see details).}
  \item{parallel}{\code{logical} scalar. Is parallel computing to be performed? Optional, defaults to \code{FALSE}.}
    \item{conf}{
        \code{list} of 5 fields containing the parameters values needed for creating the parallel backend (cluster configuration).
        See details below for usage. Optional, defaults to \code{NULL}, but all fields are required if used:
        \itemize{
            \item \code{type} : \code{character} \code{vector} specifying the cluster type ("SOCKET", "MPI").
            \item \code{spec} : A specification (\code{character} \code{vector} or \code{integer} scalar) appropriate to the type of cluster.
            \item \code{homogeneous} : \code{logical} scalar to be set to \code{FALSE} for inhomogeneous clusters.
            \item \code{verbose} : \code{logical} scalar to be set to \code{FALSE} for quiet mode.
            \item \code{outfile} : \code{character} \code{vector} of an output log file name to direct the stdout and stderr
                  connection output from the workernodes. "" indicates no redirection.
        }
    }
  \item{verbose}{\code{logical} scalar. Is the output to be verbose? Optional, defaults to \code{TRUE}.}
  \item{seed}{Positive \code{integer} scalar of the user seed to reproduce the results.}
}

\details{
    Argument \code{block} will be converted to a factor, whose levels will match the data groups. It defaults to a single group situation, 
    that is, under the assumption of equal variance between sample groups. All group sample sizes must be greater than 1, 
    otherwise the program will stop.

    Argument \code{nc.max} currently defaults to 30. Empirically, we found that this is enough for most datasets tested.
    This depends on (i) the dimensionality/sample size ratio \eqn{\frac{p}{n}}, (ii) the signal/noise ratio, and
    (iii) whether a pre-transformation has been applied (see \cite{Dazard, J-E. and J. S. Rao (2012)} for more details).
    See the cluster diagnostic function \code{\link[MVR]{cluster.diagnostic}} for more details, whether larger values of \code{nc.max} may be required.

    The function \code{mvr} relies on the R package \pkg{parallel} to create a parallel backend within an R session. This enables access to a cluster
    of compute cores and/or nodes on a local and/or remote machine(s) and scaling-up with the number of CPU cores available and efficient parallel 
    execution. To run a procedure in parallel (with parallel RNG), argument \code{parallel} is to be set to \code{TRUE} and argument \code{conf} 
    is to be specified (i.e. non \code{NULL}). Argument \code{conf} uses the options described in function \code{makeCluster} of the R packages 
    \pkg{parallel} and \pkg{snow}. \pkg{PRIMsrc} supports two types of communication mechanisms between master and worker processes: 
    'Socket' or  'Message-Passing Interface' ('MPI'). In \pkg{PRIMsrc}, parallel 'Socket' clusters use sockets communication mechanisms only 
    (no forking) and are therefore available on all platforms, including Windows, while parallel 'MPI' clusters use high-speed interconnects 
    mechanism in networks of computers (with distributed memory) and are therefore available only in these architectures. A parallel 'MPI' 
    cluster also requires R package \pkg{Rmpi} to be installed first. Value \code{type} is used to setup a cluster of type 'Socket' ("SOCKET") 
    or 'MPI' ("MPI"), respectively. Depending on this type, values of \code{spec} are to be used alternatively:
    \itemize{
        \item For 'Socket' clusters (\code{conf$type="SOCKET"}), \code{spec} should be a \code{character} \code{vector} naming the hosts on which 
              to run the job; it can default to a unique local machine, in which case, one may use the unique host name "localhost".
              Each host name can potentially be repeated to the number of CPU cores available on the local machine.
              It can also be an \code{integer} scalar specifying the number of processes to spawn on the local machine; 
              or a list of machine specifications (a character value named host specifying the name or address of the host to use).
        \item For 'MPI' clusters (\code{conf$type="MPI"}), \code{spec} should be an \code{integer} scalar 
              specifying the total number of processes to be spawned across the network of available nodes, counting the workernodes and masternode.
    }
    The actual creation of the cluster, its initialization, and closing are all done internally. For more details, 
    see the reference manual of R package \pkg{snow} and examples below.

    When random number generation is needed, the creation of separate streams of parallel RNG per node is done internally by 
    distributing the stream states to the nodes. For more details, see the vignette of R package \pkg{parallel}. 
    The use of a seed allows to reproduce the results within the same type of session: the same seed will reproduce the same results within 
    a non-parallel session or within a parallel session, but it will not necessarily give the exact same results (up to sampling variability)
    between a non-parallelized and parallelized session due to the difference of management of the seed between the two (see parallel RNG and 
    value of returned seed below).
}

\value{
    \item{Xraw}{\code{numeric} \code{matrix} of original data.}
    \item{Xmvr}{\code{numeric} \code{matrix} of MVR-transformed data.}
    \item{centering}{\code{numeric} \code{vector} of centering values for standardization (cluster mean of pooled sample mean).}
    \item{scaling}{\code{numeric} \code{vector} of scaling values for standardization (cluster mean of pooled sample std dev).}
    \item{MVR}{\code{list} (of size the number of groups) containing for each group:
        \itemize{
            \item membership {\code{numeric} \code{vector} of cluster membership of each variable}
            \item nc {Positive \code{integer} scalar of number of clusters found in optimal cluster configuration}
            \item gap {\code{numeric} \code{vector} of the similarity statistic values}
            \item sde {\code{numeric} \code{vector} of the standard errors of the similarity statistic values}
            \item mu.std {\code{numeric} \code{matrix} (\code{K} x p) of the vector of standardized means by groups (rows), 
                          where \code{K} = \#groups and \code{p} = \#variables}
            \item sd.std {\code{numeric} \code{matrix} (\code{K} x p) of the vector of standardized standard deviations by groups (rows),
                          where \code{K} = \#groups and \code{p} = \#variables}
            \item mu.quant {\code{numeric} \code{matrix} (\code{nc.max} - \code{nc.min} + 1) x (length(\code{probs})) of quantiles of means}
            \item sd.quant {\code{numeric} \code{matrix} (\code{nc.max} - \code{nc.min} + 1) x (length(\code{probs})) of quantiles of standard deviations}
        }
    }
    \item{block}{Value of argument \code{block}.}
    \item{tolog}{Value of argument \code{tolog}.}
    \item{nc.min}{Value of argument \code{nc.min}.}
    \item{nc.max}{Value of argument \code{nc.max}.}
    \item{probs}{Value of argument \code{probs}.}
    \item{seed}{User seed(s) used: \code{integer} of a single value, if parallelization is used. 
                \code{integer} \code{vector} of values, one for each replication, if parallelization is not used.}
}

\author{
    \itemize{
        \item "Jean-Eudes Dazard, Ph.D." \email{jean-eudes.dazard@case.edu}
        \item "Hua Xu, Ph.D." \email{huaxu77@gmail.com}
        \item "Alberto Santana, MBA." \email{ahs4@case.edu}
    }
    Maintainer: "Jean-Eudes Dazard, Ph.D." \email{jean-eudes.dazard@case.edu}    
}

\references{
    \itemize{
        \item Dazard J-E. and J. S. Rao (2010). "\emph{Regularized Variance Estimation and Variance Stabilization of High-Dimensional Data.}"
              In JSM Proceedings, Section for High-Dimensional Data Analysis and Variable Selection. 
              Vancouver, BC, Canada: American Statistical Association IMS - JSM, 5295-5309.
        \item Dazard J-E., Hua Xu and J. S. Rao (2011). "\emph{R package MVR for Joint Adaptive Mean-Variance Regularization and Variance Stabilization.}"
              In JSM Proceedings, Section for Statistical Programmers and Analysts. 
              Miami Beach, FL, USA: American Statistical Association IMS - JSM, 3849-3863.
        \item Dazard J-E. and J. S. Rao (2012). "\emph{Joint Adaptive Mean-Variance Regularization and Variance Stabilization of High Dimensional Data.}"
              Comput. Statist. Data Anal. 56(7):2317-2333.
    }
}

\section{Acknowledgments}{
    This work made use of the High Performance Computing Resource in the Core Facility for Advanced Research Computing at Case Western Reserve University. 
    This project was partially funded by the National Institutes of Health (P30-CA043703).
}

\note{
    End-user function.
}

\seealso{
    \itemize{
        \item \code{makeCluster} (R package \pkg{parallel}).
        \item \code{justvsn} (R package \pkg{vsn}) Variance stabilization and calibration for microarray data \cite{Huber, 2002}
    }
}

\examples{
#===================================================
# Loading the library and its dependencies
#===================================================
library("MVR")

\dontrun{
    #===================================================
    # MVR package news
    #===================================================
    MVR.news()

    #================================================
    # MVR package citation
    #================================================
    citation("MVR")

    #===================================================
    # Loading of the Synthetic and Real datasets
    # Use help for descriptions
    #===================================================
    data("Synthetic", "Real", package="MVR")
    ?Synthetic
    ?Real
}

#===================================================
# Mean-Variance Regularization (Synthetic dataset)
# Single-Group Assumption
# Assuming equal variance between groups
# Without cluster usage
#===================================================
nc.min <- 1
nc.max <- 10
probs <- seq(0, 1, 0.01)
n <- 10
mvr.obj <- mvr(data = Synthetic,
               block = rep(1,n),
               tolog = FALSE,
               nc.min = nc.min,
               nc.max = nc.max,
               probs = probs,
               B = 100,
               parallel = FALSE,
               conf = NULL,
               verbose = TRUE,
               seed = 1234)

\dontrun{
    #===================================================
    # Examples of parallel backend parametrization 
    #===================================================
    if (require("parallel")) {
       print("'parallel' is attached correctly \n")
    } else {
       stop("'parallel' must be attached first \n")
    }
    #===================================================
    # Ex. #1 - Multicore PC
    # Running WINDOWS
    # SOCKET communication cluster
    # Shared memory parallelization
    #===================================================
    cpus <- detectCores(logical = TRUE)
    conf <- list("spec" = rep("localhost", cpus),
                 "type" = "SOCKET",
                 "homo" = TRUE,
                 "verbose" = TRUE,
                 "outfile" = "")
    #===================================================
    # Ex. #2 - Master node + 3 Worker nodes cluster
    # All nodes equipped with identical setups of multicores 
    # (8 core CPUs per machine for a total of 32)
    # SOCKET communication cluster
    # Distributed memory parallelization
    #===================================================
    masterhost <- Sys.getenv("HOSTNAME")
    slavehosts <- c("compute-0-0", "compute-0-1", "compute-0-2")
    nodes <- length(slavehosts) + 1
    cpus <- 8
    conf <- list("spec" = c(rep(masterhost, cpus),
                            rep(slavehosts, cpus)),
                 "type" = "SOCKETs",
                 "homo" = TRUE,
                 "verbose" = TRUE,
                 "outfile" = "")
    #===================================================
    # Ex. #3 - Enterprise Multinode Cluster w/ multicore/node  
    # Running LINUX with SLURM scheduler
    # MPI communication cluster
    # Distributed memory parallelisation
    #==================================================
    if (require("Rmpi")) {
        print("'Rmpi' is attached correctly \n")
    } else {
        stop("'Rmpi' must be attached first \n")
    }
    # Below, variable 'cpus' is the total number of requested 
    # taks (threads/CPUs), which is specified from within a 
    # SLURM script.
    cpus <- as.numeric(Sys.getenv("SLURM_NTASKS"))
    conf <- list("spec" = cpus,
                 "type" = "MPI",
                 "homo" = TRUE,
                 "verbose" = TRUE,
                 "outfile" = "")
    #===================================================
    # Mean-Variance Regularization (Real dataset)
    # Multi-Group Assumption
    # Assuming unequal variance between groups
    #===================================================
    nc.min <- 1
    nc.max <- 30
    probs <- seq(0, 1, 0.01)
    n <- 6
    GF <- factor(gl(n = 2, k = n/2, length = n),
                 ordered = FALSE,
                 labels = c("M", "S"))
    mvr.obj <- mvr(data = Real,
                   block = GF,
                   tolog = FALSE,
                   nc.min = nc.min,
                   nc.max = nc.max,
                   probs = probs,
                   B = 100,
                   parallel = TRUE,
                   conf = conf,
                   verbose = TRUE,
                   seed = 1234)
    }
}

\keyword{Mean-Variance Estimators}
\keyword{Regularization}
\keyword{Variance Stabilization}
\keyword{Normalization}
\keyword{Parallel Programming}
\keyword{High Performance Computing}

