Provenance is the history of an item of data from its creation to its present state. It includes details about the steps that were executed and the intermediate values that were created in order to produce the data in its current form. For scientists, provenance can help to facilitate reproduction and validation of scientific results. But in most computer systems today, provenance is an after-thought, implemented as an auxiliary indexing structure that parallels the actual data. Our goal in this project is to design, build, and study an end-to-end system that extends all the way from original data analyses by real scientists to management and analysis of the resulting provenance in a common framework with common tools.
rdtLite is a light-weight provenance collection tool that collects provenance as an R script executes (or during a console session) and saves it in a file. The resulting provenance can be used for a wide variety of applications that include debugging scripts, cleaning code, and reproducing results.
rdtLite currently requires R version 3.6.0 (or later). rdtLite is easily installed from GitHub using devtools:
devtools::install_github("End-to-end-provenance/rdtLite")
Once installed, use the R library command to load rdtLite:
library(rdtLite)
To capture provenance for an R script named “my-script.R”, set the working directory to the directory conaining my-script.R, load the rdtLite package (as above), and enter the following:
prov.run("my-script.R", prov.dir=".")
where “my-script.R” is an R script in the working directory. The prov.run command will execute the script and save the provenance in a subdirectory called “prov_my-script” in the current directory.
rdtLite can also be used to capture provenance for commands entered in the console. To do so, enter the following:
prov.init(prov.dir=".")
and enter commands at the R console. To save the provenance, enter the following:
prov.save()
This will save the provenance collected so far to a subdirectory called “prov_console” in the current directory. To save the provenance and stop further provenance collection, enter the following:
prov.quit()
prov.save can be called many times before calling prov.quit.
In the examples above, we used the prov.dir parameter to specify where provenance should be stored. Another way to specify where to save the provenance is by specifying this in your .Rprofile file as follows:
options (prov.dir="~/prov")
We recommend setting this option in your .Rprofile. Good choices are to either have one directory hold all the provenance, or to use “.” to store the provenance with the script it belongs to. In any case, the provenance will be saved in a subdirectory named prov_script (where script is replaced with the script’s name) or prov_console (if the provenance is for console commands).
By default, when you collect provenance, the provenance is saved in a directory based on the name of the script (or console). This means that if you run the same script repeatedly, the provenance will be overwritten. To prevent this from happening, use the overwrite parameter in either prov.run or prov.init:
prov.run("my-script.R", overwrite = FALSE)
prov.init(overwrite = FALSE)
In this case, the provenance directory will include a timestamp, like:
prov_my-script_2019-08-21T14.06.02EDT/
prov_console_2019-08-21T14.06.02EDT/
When a variable is assigned in your script or a console command, the value can be saved as part of the provenance. This can be very helpful if you use the provenance to debug your script. By default, only simple data values such as numerics, logicals, and strings are saved. To save larger values, such as data frames, tibbles, or matrices, you need to set the snapshot.size parameter:
prov.run("my-script.R", snapshot.size=1)
prov.init(snapshot.size=1)
If snapshot.size is set to something other than 0, larger data values will be saved in snapshot files. The size of each snapshot file is limited to the value specified in snapshot.size, where the units are kilobytes. Thus, setting snapshot.size to 1 will save the head of the data value, truncating the value if it is more than 1K in size.
Increasing the snapshot size will allow for more thorough debugging. However, if your script makes many updates to large data structures, the slowdown can be unacceptable.
If you are only interested in collecting provenance about the computing environment, input and output files and plots, and the script source code, you can set the details parameter of prov.run to FALSE. (This parameter is not available in prov.init.)
prov.run("my-script.R", details = FALSE)
This type of provenance is useful for creating provenance summaries, but not for other purposes, such as debugging. The main advantage is that there should be minimal slowdown when executing your script.
For information on additional parameters to prov.run and prov.init, please refer to the Help page for these functions.
rdtLite can collect provenance on both R files and R Markdown files. Invoke prov.run in the same way:
prov.run("my-script.Rmd")
This will both run R Markdown to create the formatted document and collect provenance.
Alternatively, you can use prov.init, and run your R Markdown interactively, using Run Next Chunk in RStudio, for example.
There are two caveats to using prov.run with R Markdown files:
If you use random numbers, you should set the seed initially, using R’s set.seed function. If you do not do this, the provenance will not exactly match the document produced by R Markdown.
You should avoid using R Markdown’s caching feature.
If you use the source function in your code, we recommend that you replace those calls to source with a call to prov.source instead. This will allow provenance collection to occur within the sourced function. Without doing that, the source call will appear as a single statement within the provenance.
Having collected provenance, you may wonder what you can do with it. We have some tools that use the provenance and are available at https://github.com/End-to-end-provenance:
provSummarizeR provides a textual summary of the provenance identifying input and output files, libraries used, the version of R, the computing platform, and other useful information
provViz provides a visualization of the provenance that allows you to move through the history of your script to see how values were computed and what the intermediate values are. provViz requires Java to be installed.
provExplainR compares provenance from two executions of a script to see what has changed. This can be helpful if a script mysteriously stops working, or if you share a script with a colleague and it does not work for them. For example, it could indicate that you are using different library versions, the input data has changed, or a variety of other causes for changed behavior.
If you have any problems, questions, or suggestions, please let us know at https://github.com/End-to-end-provenance/rdtLite/issues.