% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/delete_censoring.R
\name{delete_MAR_censoring}
\alias{delete_MAR_censoring}
\title{Create MAR values using a censoring mechanism}
\usage{
delete_MAR_censoring(
  ds,
  p,
  cols_mis,
  cols_ctrl,
  n_mis_stochastic = FALSE,
  where = "lower",
  sorting = TRUE,
  miss_cols,
  ctrl_cols
)
}
\arguments{
\item{ds}{A data frame or matrix in which missing values will be created.}

\item{p}{A numeric vector with length one or equal to length \code{cols_mis};
the probability that a value is missing.}

\item{cols_mis}{A vector of column names or indices of columns in which
missing values will be created.}

\item{cols_ctrl}{A vector of column names or indices of columns, which
controls the creation of missing values in \code{cols_mis}. Must be of the
same length as \code{cols_mis}.}

\item{n_mis_stochastic}{Logical, should the number of missing values be
stochastic? If \code{n_mis_stochastic = TRUE}, the number of missing values
for a column with missing values \code{cols_mis[i]} is a random variable
with expected value \code{nrow(ds) * p[i]}. If \code{n_mis_stochastic =
FALSE}, the number of missing values will be deterministic. Normally, the
number of missing values for a column with missing values
\code{cols_mis[i]} is \code{round(nrow(ds) * p[i])}. Possible deviations
from this value, if any exists, are documented in Details.}

\item{where}{Controls where missing values are created; one of "lower",
"upper" or "both" (see details).}

\item{sorting}{Logical; should sorting be used or a quantile as a threshold.}

\item{miss_cols}{Deprecated, use \code{cols_mis} instead.}

\item{ctrl_cols}{Deprecated, use \code{cols_ctrl} instead.}
}
\value{
An object of the same class as \code{ds} with missing values.
}
\description{
Create missing at random (MAR) values using a censoring mechanism in a data
frame or a matrix
}
\details{
This function creates missing at random (MAR) values in the columns
specified by the argument \code{cols_mis}.
The probability for missing values is controlled by \code{p}.
If \code{p} is a single number, then the overall probability for a value to
be missing will be \code{p} in all columns of \code{cols_mis}.
(Internally \code{p} will be replicated to a vector of the same length as
\code{cols_mis}.
So, all \code{p[i]} in the following sections will be equal to the given
single number \code{p}.)
Otherwise, \code{p} must be of the same length as \code{cols_mis}.
In this case, the overall probability for a value to be missing will be
\code{p[i]} in the column \code{cols_mis[i]}.
The position of the missing values in \code{cols_mis[i]} is controlled by
\code{cols_ctrl[i]}.
The following procedure is applied for each pair of \code{cols_ctrl[i]} and
\code{cols_mis[i]} to determine the positions of missing values:

The default behavior (\code{sorting = TRUE}) of this function is to first
sort the column \code{cols_ctrl[i]}. Then missing values in
\code{cols_mis[i]} are created in the rows with the \code{round(nrow(ds) *
p[i])} smallest values. This censors approximately the proportion of
\code{p[i]} rows of smallest values in \code{cols_ctrl[i]} in
\code{cols_mis[i]}. Hence, the name of the function.

If \code{where = "upper"}, instead of the rows with the smallest values, the
rows with the highest values will be selected. For \code{where = "both"}, the
one half of the \code{round(nrow(ds) * p[i])} rows with missing values will
be the rows with the smallest values and the other half will be the rows with
the highest values. So the censoring rows are dived to the highest and
smallest values of \code{cols_ctrl[i]}. For odd \code{round(nrow(ds) * p[i])}
one more value is set \code{NA} among the smallest values.

If \code{n_mis_stochastic = TRUE} and \code{sorting = TRUE} the procedure is
lightly altered. In this case, at first the \code{floor(nrow(ds) * p[i])}
rows with the smallest values (\code{where = "lower"}) are set NA. If
\code{nrow(ds) * p[i] > floor(nrow(ds) * p[i])}, the row with the next
greater value will be set NA with a probability to get expected
\code{nrow(ds) * p[i]} missing values. For \code{where = "upper"} this
"random" missing value will be the next smallest. For \code{where = "both"}
this "random" missing value will be the next greatest of the smallest values.

If \code{sorting = FALSE}, the rows of \code{ds} will not be sorted. Instead,
a quantile will be calculated (using \code{\link[stats]{quantile}}). If
\code{where = "lower"}, the \code{quantile(ds[, cols_ctrl[i]], p[i])} will be
calculated and all rows with values in \code{ds[, cols_ctrl[i]]} below this
quantile will have missing values in \code{cols_mis[i]}. For \code{where =
"upper"}, the \code{quantile(ds[, cols_ctrl[i]], 1 - p[i])} will be
calculated and all rows with values above this quantile will have missing
values. For \code{where = "both"}, the \code{quantile(ds[, cols_ctrl[i]],
p[i] / 2)} and \code{quantile(ds[, cols_ctrl[i]], 1 -  p[i] / 2)} will be
calculated. All rows with values in \code{cols_ctrl[i]} below the first
quantile or above the second quantile will have missing values in
\code{cols_mis[i]}.

For \code{sorting = FALSE} only \code{n_mis_stochastic = FALSE} is
implemented at the moment.

The option \code{sorting = TRUE} with \code{n_mis_stochastic = FALSE} will
always create exactly \code{round(nrow(ds) * p[i])} missing values in
\code{cols_mis[i]}. With \code{n_mis_stochastic = TRUE}) sorting will result
in \code{floor(nrow(ds) * p[i])} or \code{ceiling(nrow(ds) * p[i])} missing
values in \code{cols_mis[i]}. For \code{sorting = FALSE}, the number of
missing values will normally be close to \code{nrow(ds) * p[i]}. But for
\code{cols_ctrl} with many duplicates the choice \code{sorting = FALSE} can
be problematic, because of the calculation of \code{quantile(ds[,
cols_ctrl[i]], p[i])} and setting values \code{NA} below this threshold (see
examples). So, in most cases \code{sorting = TRUE} is recommended.
}
\examples{
ds <- data.frame(X = 1:20, Y = 101:120)
delete_MAR_censoring(ds, 0.2, "X", "Y")
# many dupplicated values can be problematic for sorting = FALSE:
ds_many_dup <- data.frame(X = 1:20, Y = c(rep(0, 10), rep(1, 10)))
delete_MAR_censoring(ds_many_dup, 0.2, "X", "Y") # 4 NAs as expected
quantile(ds_many_dup$Y, 0.2) # 0
# No value is BELOW 0 in ds_many_dup$Y, so no missing values will be created:
delete_MAR_censoring(ds_many_dup, 0.2, "X", "Y", sorting = FALSE) # No NA!
}
\references{
Santos, M. S., Pereira, R. C., Costa, A. F., Soares, J. P.,
  Santos, J., & Abreu, P. H. (2019). Generating Synthetic Missing Data: A
  Review by Missing Mechanism. \emph{IEEE Access}, 7, 11651-11667
}
\seealso{
\code{\link{delete_MNAR_censoring}}

Other functions to create MAR: 
\code{\link{delete_MAR_1_to_x}()},
\code{\link{delete_MAR_one_group}()},
\code{\link{delete_MAR_rank}()}
}
\concept{functions to create MAR}
