% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/draft_validation.R
\name{draft_validation}
\alias{draft_validation}
\title{Draft a starter \strong{pointblank} validation .R/.Rmd file with a data table}
\usage{
draft_validation(
  tbl,
  tbl_name = NULL,
  file_name = tbl_name,
  path = NULL,
  lang = NULL,
  output_type = c("R", "Rmd"),
  add_comments = TRUE,
  overwrite = FALSE,
  quiet = FALSE
)
}
\arguments{
\item{tbl}{The input table. This can be a data frame, tibble, a \code{tbl_dbi}
object, or a \code{tbl_spark} object.}

\item{tbl_name}{A optional name to assign to the input table object. If no
value is provided, a name will be generated based on whatever information
is available. This table name will be displayed in the header area of the
agent report generated by printing the \emph{agent} or calling
\code{\link[=get_agent_report]{get_agent_report()}}.}

\item{file_name}{An optional name for the .R or .Rmd file. This should be a
name without an extension. By default, this is taken from the \code{tbl_name}
but if nothing is supplied for that, the name will contain the text
\code{"draft_validation_"} followed by the current date and time.}

\item{path}{A path can be specified here if there shouldn't be an attempt to
place the generated file in the working directory.}

\item{lang}{The language to use when creating comments for the automatically-
generated validation steps. By default, \code{NULL} will create English (\code{"en"})
text. Other options include French (\code{"fr"}), German (\code{"de"}), Italian
(\code{"it"}), Spanish (\code{"es"}), Portuguese (\code{"pt"}), Turkish (\code{"tr"}), Chinese
(\code{"zh"}), Russian (\code{"ru"}), Polish (\code{"pl"}), Danish (\code{"da"}), Swedish
(\code{"sv"}), and Dutch (\code{"nl"}).}

\item{output_type}{An option for choosing what type of output should be
generated. By default, this is an .R script (\code{"R"}) but this could
alternatively be an R Markdown document (\code{"Rmd"}).}

\item{add_comments}{Should there be comments that explain the features of the
validation plan in the generated document? By default, this is \code{TRUE}.}

\item{overwrite}{Should a file of the same name be overwritten? By default,
this is \code{FALSE}.}

\item{quiet}{Should the function \emph{not} inform when the file is written? By
default this is \code{FALSE}.}
}
\value{
Invisibly returns \code{TRUE} if the file has been written.
}
\description{
Generate a draft validation plan in a new .R or .Rmd file using an input data
table. Using this workflow, the data table will be scanned to learn about its
column data and a set of starter validation steps (constituting a validation
plan) will be written. It's best to use a data extract that contains at least
1000 rows and is relatively free of spurious data.

Once in the file, it's possible to tweak the validation steps to better fit
the expectations to the particular domain. While column inference is used to
generate reasonable validation plans, it is difficult to infer the acceptable
values without domain expertise. However, using \code{draft_validation()} could
get you started on floor 10 of tackling data quality issues and is in any
case better than starting with an empty code editor view.
}
\section{Supported Input Tables}{

The types of data tables that are officially supported are:
\itemize{
\item data frames (\code{data.frame}) and tibbles (\code{tbl_df})
\item Spark DataFrames (\code{tbl_spark})
\item the following database tables (\code{tbl_dbi}):
\itemize{
\item \emph{PostgreSQL} tables (using the \code{RPostgres::Postgres()} as driver)
\item \emph{MySQL} tables (with \code{RMySQL::MySQL()})
\item \emph{Microsoft SQL Server} tables (via \strong{odbc})
\item \emph{BigQuery} tables (using \code{bigrquery::bigquery()})
\item \emph{DuckDB} tables (through \code{duckdb::duckdb()})
\item \emph{SQLite} (with \code{RSQLite::SQLite()})
}
}

Other database tables may work to varying degrees but they haven't been
formally tested (so be mindful of this when using unsupported backends with
\strong{pointblank}).
}

\section{Examples}{


Let's draft a validation plan for the \code{dplyr::storms} dataset.

\if{html}{\out{<div class="sourceCode r">}}\preformatted{dplyr::storms
#> # A tibble: 19,066 x 13
#>    name   year month   day  hour   lat  long status      category  wind pressure
#>    <chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <fct>          <dbl> <int>    <int>
#>  1 Amy    1975     6    27     0  27.5 -79   tropical d~       NA    25     1013
#>  2 Amy    1975     6    27     6  28.5 -79   tropical d~       NA    25     1013
#>  3 Amy    1975     6    27    12  29.5 -79   tropical d~       NA    25     1013
#>  4 Amy    1975     6    27    18  30.5 -79   tropical d~       NA    25     1013
#>  5 Amy    1975     6    28     0  31.5 -78.8 tropical d~       NA    25     1012
#>  6 Amy    1975     6    28     6  32.4 -78.7 tropical d~       NA    25     1012
#>  7 Amy    1975     6    28    12  33.3 -78   tropical d~       NA    25     1011
#>  8 Amy    1975     6    28    18  34   -77   tropical d~       NA    30     1006
#>  9 Amy    1975     6    29     0  34.4 -75.8 tropical s~       NA    35     1004
#> 10 Amy    1975     6    29     6  34   -74.8 tropical s~       NA    40     1002
#> # i 19,056 more rows
#> # i 2 more variables: tropicalstorm_force_diameter <int>,
#> #   hurricane_force_diameter <int>
}\if{html}{\out{</div>}}

The \code{draft_validation()} function creates an .R file by default. Using just
the defaults with \code{dplyr::storms} will yield the \code{"dplyr__storms.R"} file
in the working directory. Here are the contents of the file:

\if{html}{\out{<div class="sourceCode r">}}\preformatted{library(pointblank)

agent <-
  create_agent(
    tbl = ~ dplyr::storms,
    actions = action_levels(
      warn_at = 0.05,
      stop_at = 0.10
    ),
    tbl_name = "dplyr::storms",
    label = "Validation plan generated by `draft_validation()`."
  ) \%>\%
  # Expect that column `name` is of type: character
  col_is_character(
    columns = vars(name)
  ) \%>\%
  # Expect that column `year` is of type: numeric
  col_is_numeric(
    columns = vars(year)
  ) \%>\%
  # Expect that values in `year` should be between `1975` and `2020`
  col_vals_between(
    columns = vars(year),
    left = 1975,
    right = 2020
  ) \%>\%
  # Expect that column `month` is of type: numeric
  col_is_numeric(
    columns = vars(month)
  ) \%>\%
  # Expect that values in `month` should be between `1` and `12`
  col_vals_between(
    columns = vars(month),
    left = 1,
    right = 12
  ) \%>\%
  # Expect that column `day` is of type: integer
  col_is_integer(
    columns = vars(day)
  ) \%>\%
  # Expect that values in `day` should be between `1` and `31`
  col_vals_between(
    columns = vars(day),
    left = 1,
    right = 31
  ) \%>\%
  # Expect that column `hour` is of type: numeric
  col_is_numeric(
    columns = vars(hour)
  ) \%>\%
  # Expect that values in `hour` should be between `0` and `23`
  col_vals_between(
    columns = vars(hour),
    left = 0,
    right = 23
  ) \%>\%
  # Expect that column `lat` is of type: numeric
  col_is_numeric(
    columns = vars(lat)
  ) \%>\%
  # Expect that values in `lat` should be between `-90` and `90`
  col_vals_between(
    columns = vars(lat),
    left = -90,
    right = 90
  ) \%>\%
  # Expect that column `long` is of type: numeric
  col_is_numeric(
    columns = vars(long)
  ) \%>\%
  # Expect that values in `long` should be between `-180` and `180`
  col_vals_between(
    columns = vars(long),
    left = -180,
    right = 180
  ) \%>\%
  # Expect that column `status` is of type: character
  col_is_character(
    columns = vars(status)
  ) \%>\%
  # Expect that column `category` is of type: factor
  col_is_factor(
    columns = vars(category)
  ) \%>\%
  # Expect that column `wind` is of type: integer
  col_is_integer(
    columns = vars(wind)
  ) \%>\%
  # Expect that values in `wind` should be between `10` and `160`
  col_vals_between(
    columns = vars(wind),
    left = 10,
    right = 160
  ) \%>\%
  # Expect that column `pressure` is of type: integer
  col_is_integer(
    columns = vars(pressure)
  ) \%>\%
  # Expect that values in `pressure` should be between `882` and `1022`
  col_vals_between(
    columns = vars(pressure),
    left = 882,
    right = 1022
  ) \%>\%
  # Expect that column `tropicalstorm_force_diameter` is of type: integer
  col_is_integer(
    columns = vars(tropicalstorm_force_diameter)
  ) \%>\%
  # Expect that values in `tropicalstorm_force_diameter` should be between
  # `0` and `870`
  col_vals_between(
    columns = vars(tropicalstorm_force_diameter),
    left = 0,
    right = 870,
    na_pass = TRUE
  ) \%>\%
  # Expect that column `hurricane_force_diameter` is of type: integer
  col_is_integer(
    columns = vars(hurricane_force_diameter)
  ) \%>\%
  # Expect that values in `hurricane_force_diameter` should be between
  # `0` and `300`
  col_vals_between(
    columns = vars(hurricane_force_diameter),
    left = 0,
    right = 300,
    na_pass = TRUE
  ) \%>\%
  # Expect entirely distinct rows across all columns
  rows_distinct() \%>\%
  # Expect that column schemas match
  col_schema_match(
    schema = col_schema(
      name = "character",
      year = "numeric",
      month = "numeric",
      day = "integer",
      hour = "numeric",
      lat = "numeric",
      long = "numeric",
      status = "character",
      category = c("ordered", "factor"),
      wind = "integer",
      pressure = "integer",
      tropicalstorm_force_diameter = "integer",
      hurricane_force_diameter = "integer"
    )
  ) \%>\%
  interrogate()

agent
}\if{html}{\out{</div>}}

This is runnable as is, and the promise is that the interrogation should
produce no failing test units. After execution, we get the following
validation report:

\if{html}{
\out{
<img src="https://raw.githubusercontent.com/rich-iannone/pointblank/main/images/man_draft_validation_1.png" alt="This image was generated from the first code example in the `draft_validation()` help file." style="width:100\%;">
}
}

All of the expressions in the resulting file constitute just a rough
approximation of what a validation plan should be for a dataset. Certainly,
the value ranges in the emitted \code{\link[=col_vals_between]{col_vals_between()}} may not be realistic for
the \code{wind} column and may require some modification (the provided \code{left} and
\code{right} values are just the limits of the provided data). However, note that
the \code{lat} and \code{long} (latitude and longitude) columns have acceptable ranges
(providing the limits of valid lat/lon values). This is thanks to
\strong{pointblank}'s column inference routines, which is able to understand what
certain columns contain.

For an evolving dataset that will experience changes (either in the form of
revised data and addition/deletion of rows or columns), the emitted
validation will serve as a good first step and changes can more easily be
made since there is a foundation to build from.
}

\section{Function ID}{

1-11
}

\seealso{
Other Planning and Prep: 
\code{\link{action_levels}()},
\code{\link{create_agent}()},
\code{\link{create_informant}()},
\code{\link{db_tbl}()},
\code{\link{file_tbl}()},
\code{\link{scan_data}()},
\code{\link{tbl_get}()},
\code{\link{tbl_source}()},
\code{\link{tbl_store}()},
\code{\link{validate_rmd}()}
}
\concept{Planning and Prep}
