--- title: "Introduction to orderly2" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Introduction to orderly2} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} source("common.R") ``` This vignette provides a how-to style introduction to `orderly2`, an overview of key ingredients to writing orderly reports, and a summary of key features and ideas. It may be useful to look at `vignette("orderly2")` for a more roundabout discussion of what `orderly2` is trying to achieve, or `vignette("migrating")` if you are familiar with version 1 of orderly as this explains concepts in terms of differences from the previous version. # Installation ```r install.packages( "orderly2", repos = c("https://mrc-ide.r-universe.dev", "https://cloud.r-project.org")) ``` # Creating an empty orderly repository The first step is to initialise an empty `orderly2` repository. An `orderly2` repository is a directory with the file `orderly_config.yml` within it, and since version 2 also a directory `.outpack/`. Files within the `.outpack/` directory should never be directly modified by users and this directory should be excluded from version control (see `orderly2::orderly_gitignore_update`). Create an orderly2 repository by calling `orderly2::orderly_init()`: ```{r} path <- tempfile() # we'll use a temporary directory here - see note below orderly2::orderly_init(path) ``` which creates a few files: ```{r, echo = FALSE} dir_tree(path, all = TRUE) ``` This step should be performed on a completely empty directory, otherwise an error will be thrown. Later, you will re-initialise an `orderly2` repository when cloning to a new machine, such as when working with others; this is discussed in `vignette("collaboration")`. The `orderly_config.yml` file contains very little by default: ```{r, echo = FALSE, results = "asis"} yaml_output(readLines(file.path(path, "orderly_config.yml"))) ``` For this vignette, the created orderly root is in R's per-session temporary directory, which will be deleted once R exits. If you want to use a directory that will persist across restarting R (which you would certainly want when using `orderly2` on a real project!) you should replace this with a path within your home directory, or other location that you control. For the rest of the vignette we will evaluate commands from within this directory, by changing the directory to the path we've created: ```r setwd(path) ``` # Creating your first orderly report An orderly report is a directory `src/` containing an orderly file `.R`. That file may have special commands in it, but for now we'll create one that is as simple as possible; we'll create some random data and save it to disk. This seems silly, but imagine this standing in for something like: * downloading file from some external site or resource * running a simulation and saving output * fitting a model to data * merging some set of files together to create a final data set ```{r, include = FALSE} fs::dir_create(file.path(path, "src", "incoming_data")) write.csv(data.frame(x = 1:10, y = 1:10 + rnorm(10)), file.path(path, "src", "incoming_data", "data.csv"), row.names = FALSE) writeLines(c( 'd <- read.csv("data.csv")', "d$z <- resid(lm(y ~ x, d))", 'saveRDS(d, "data.rds")'), file.path(path, "src", "incoming_data", "incoming_data.R")) ``` Our directory structure (ignoring `.outpack`) looks like: ```{r, echo = FALSE} dir_tree(path, all = FALSE) ``` and `src/incoming_data/incoming_data.R` contains: ```{r, echo = FALSE, results = "asis"} r_output(readLines(file.path(path, "src/incoming_data/incoming_data.R"))) ``` To run the report and create a new **packet**, use `orderly2::orderly_run()`: ```{r, orderly_root = path} id <- orderly2::orderly_run("incoming_data") id ``` The `id` that is created is a new identifier for the packet that will be both unique among all packets (within reason) and chronologically sortable. A packet that has an id that sorts after another packet's id was started before that packet. Having run the report, our directory structure looks like: ```{r, echo = FALSE} dir_tree(path, all = FALSE) ``` A few things have changed here: * we have a directory `r paste0("archive/incoming_data/", id)`; this directory contains - the file that was created when we ran the report (`data.rds`; see the script above) - a log of what happened when the report was run and the packet was created - `incoming_data.R` and `data.csv`, the original input that have come from our source tree * there is an empty directory `draft/incoming_data` which was created when orderly ran the report in the first place In addition, quite a few files have changed within the `.outpack` directory, but these are not covered here. That's it! Notice that the initial script is just a plain R script, and you can develop it interactively from within the `src/incoming_data` directory. Note however, that any paths referred to within will be relative to `src/incoming_data` and **not** the orderly repository root. This is important as all reports only see the world relative to their `incoming_data.R` file. Once created, you can then refer to this report by id and pull its files wherever you need them, both in the context of another orderly report or just to copy to your desktop to email someone. For example, to copy the file `data.rds` that we created to some location outside of orderly's control you could do ```{r, orderly_root = path} dest <- tempfile() fs::dir_create(dest) orderly2::orderly_copy_files(id, files = c("final.rds" = "data.rds"), dest = dest) ``` which copies `data.rds` to some new temporary directory `dest` with name `final.rds`. This uses `orderly2`'s `outpack_` functions, which are designed to interact with outpack archives regardless of how they were created (`orderly2` is a program that creates `outpack` archives). Typically these are lower-level than `orderly_` functions. # Depending on packets from another report Creating a new dataset is mostly useful if someone else can use it. To do this we introduce the first of the special orderly commands that you can use from an orderly file ```{r, include = FALSE} fs::dir_create(file.path(path, "src", "analysis")) writeLines(c( 'orderly2::orderly_dependency("incoming_data", "latest()",', ' c("incoming.rds" = "data.rds"))', 'd <- readRDS("incoming.rds")', 'png("analysis.png")', "plot(y ~ x, d)", "dev.off()"), file.path(path, "src", "analysis", "analysis.R")) ``` The `src/` directory now looks like: ```{r, echo = FALSE} dir_tree(file.path(path), "src") ``` and `src/analysis/analysis.R` contains: ```{r, echo = FALSE, results = "asis"} r_output(readLines(file.path(path, "src/analysis/analysis.R"))) ``` Here, we've used `orderly2::orderly_dependency()` to pull in the file `data.rds` from the most recent version (`latest()`) of the `data` packet with the filename `incoming.rds`, then we've used that file as normal to make a plot, which we've saved as `analysis.png`. We can run this just as before, using `orderly2::orderly_run()`: ```{r, orderly_root = path} id <- orderly2::orderly_run("analysis") ``` For more information on dependencies, see `vignette("dependencies")`. # Available in-report orderly commands The function `orderly2::orderly_dependency()` is designed to operate while the packet runs. These functions all act by adding metadata to the final packet, and perhaps by copying files into the directory. * `orderly2::orderly_description()`: Provide a longer name and description for your report; this can be reflected in tooling that uses orderly metadata to be much more informative than your short name. * `orderly2::orderly_parameters()`: Declares parameters that can be passed in to control the behaviour of the report. Parameters are key-value pairs of simple data (booleans, numbers, strings) which your report can respond to. They can also be used in queries to `orderly2::orderly_dependency()` to find packets that satisfy some criteria. * `orderly2::orderly_resource()`: Declares that a file is a *resource*; a file that is an input to the the report, and which comes from this source directory. By default, orderly treats all files in the directory as a resource, but it can be useful to mark these explicitly, and necessary to do so in "strict mode" (see below). Files that have been marked as a resource are **immutable** and may not be deleted or modified. * `orderly2::orderly_shared_resource()`: Copies a file from the "shared resources" directory `shared/`, which can be data files or source code located at the root of the orderly repository. This can be a reasonable way of sharing data or commonly used code among several reports. * `orderly2::orderly_artefact()`: Declares that a file (or set of files) will be created by this report, before it is even run. Doing this makes it easier to check that the report behaves as expected and can allow reasoning about what a related set of reports will do without running them. By declaring something as an artefact (especially in conjunction with "strict mode") it is also easier to clean up `src` directories that have been used in interactive development (see below). * `orderly2::orderly_dependency()`: Copy files from one packet into this packet as it runs, as seen above. * `orderly2::orderly_strict_mode()`: Declares that this report will be run in "strict mode" (see below). In addition, there is also a function `orderly::orderly_run_info()` that can be used while running a report that returns information about the currently running report (its id, resolved dependencies etc). Let's add some additional annotations to the previous reports: ```{r, include = FALSE} code_data <- readLines( file.path(path, "src", "incoming_data", "incoming_data.R")) writeLines(c( "orderly2::orderly_strict_mode()", 'orderly2::orderly_resource("data.csv")', 'orderly2::orderly_artefact("Processed data", "data.rds")', "", code_data), file.path(path, "src", "incoming_data", "incoming_data.R")) ``` ```{r, echo = FALSE, results = "asis"} r_output(readLines(file.path(path, "src/incoming_data/incoming_data.R"))) ``` Here, we've added a block of special orderly commands; these could go anywhere, for example above the files that they refer to. If strict mode is enabled (see below) then `orderly2::orderly_resource` calls must go before the files are used as they will only be made available at that point (see below). ```{r, orderly_root = path} id <- orderly2::orderly_run("incoming_data") ``` # Parameterised reports Much of the flexibility that comes from the orderly graph comes from using parameterised reports; these are reports that take a set of parameters and then change behaviour based on these parameters. Downstream reports can depend on a parameterised report and filter based on suitable parameters. For example, consider a simple report where we generate samples based on some parameter: ```{r, include = FALSE} fs::dir_create(file.path(path, "src", "random")) writeLines(c( "orderly2::orderly_parameters(n_samples = 10)", "x <- seq_len(n_samples)", "d <- data.frame(x = x, y = x + rnorm(n_samples))", 'saveRDS(d, "data.rds")'), file.path(path, "src", "random", "random.R")) ``` ```{r, echo = FALSE, results = "asis"} r_output(readLines(file.path(path, "src/random/random.R"))) ``` This creates a report that has a single parameter `n_samples` with a default value of 10. We could have used ```r orderly2::orderly_parameters(n_samples = NULL) ``` to define a parameter with no default, or defined multiple parameters with ```r orderly2::orderly_parameters(n_samples = 10, distribution = "normal") ``` You can do anything in your report that switches on the value of a parameter: * You might read different URLs to fetch different underlying data * You might fit a different analysis * You might read different shared resources (see below) * You might depend on different dependencies * You might produce different artefacts However, you should see parameters as relatively heavyweight things and try to have a consistent set over all packets created from a report. In this report we use it to control the size of the generated data set. ```{r, orderly_root = path} id <- orderly2::orderly_run("random", list(n_samples = 15)) ``` Our resulting file has 15 rows, as the parameter we passed in affected the report: ```{r, orderly_root = path} orderly2::orderly_copy_files(id, files = c("random.rds" = "data.rds"), dest = dest) readRDS(file.path(dest, "random.rds")) ``` You can use these parameters in orderly's search functions. For example we can find the most recent version of a packet by running: ```{r, orderly_root = path} orderly2::orderly_search('latest(name == "random")') ``` But we can also pass in parameter queries here: ```{r, orderly_root = path} orderly2::orderly_search('latest(name == "random" && parameter:n_samples > 10)') ``` These can be used within `orderly2::orderly_dependency()` (the `name == "random"` part is implied by the first `name` argument), for example ```r orderly2::orderly_dependency("random", "latest(parameter:n_samples > 10)", c("randm.rds" = "data.rds")) ``` In this case if the report that you are querying *from* also has parameters you can use these within the query, using the `this` prefix. So suppose our downstream report simply uses `n` for the number of samples we might write: ```r orderly2::orderly_dependency("random", "latest(parameter:n_samples == this:n)", c("randm.rds" = "data.rds")) ``` to depend on the most recent packet called `random` where it has a parameter `n_samples` which has the same value as the current report's parameter `n`. See the outpack query documentation for much more detail on this. # Shared resources Sometimes it is useful to share data between different reports, for example some common source utilities that don't warrant their own package, or some common data. To do this, create a directory `shared` at the orderly root and put in it any files or directories you might want to share. Suppose our shared directory contains a file `data.csv`: ```{r, echo = FALSE} fs::dir_create(file.path(path, "shared")) write.csv(data.frame(x = 1:10, y = runif(10)), file.path(path, "shared/data.csv")) dir_tree(path) ``` We can then write an orderly report `use_shared` that uses this shared file, with its `use_shared.R` containing: ```{r, echo = FALSE, results = "asis"} fs::dir_create(file.path(path, "src", "use_shared")) writeLines(c( 'orderly2::orderly_shared_resource("data.csv")', 'orderly2::orderly_artefact("analysis", "analysis.png")', "", 'd <- read.csv("data.csv")', 'png("analysis.png")', "plot(y ~ x, d)", "dev.off()"), file.path(path, "src", "use_shared", "use_shared.R")) r_output(readLines(file.path(path, "src/use_shared/use_shared.R"))) ``` We can run this: ```{r, orderly_root = path} id <- orderly2::orderly_run("use_shared") ``` In the resulting archive, the file that was used from the shared directory is present: ```{r, echo = FALSE} dir_tree(path, "archive/use_shared") ``` This is a general property of orderly: it tries to save all the inputs alongside the final results of the analysis, so that later on you can check to see what went into an analysis and what might have changed between versions. # Strict mode The previous version of orderly (`orderly1`; see `vignette("migrating")`) was very fussy about all input being strictly declared before a report could be run, so that it was clear what was really required in order to run something. From version 2 this is relaxed by default, but you can opt into most of the old behaviours and checks by adding ```r orderly2::orderly_strict_mode() ``` anywhere within your orderly file (conventionally at the top). We may make this more granular in future, but by adding this we: * only copy files from the source directory (`src//`) to the draft directory where the report runs (`draft//`) that were declared with `orderly2::orderly_resource`; this leaves behind any extra files left over in development * warn at the end of running a packet if any files are found that are not part of an artefact Using strict mode also helps `orderly2` clean up the `src/` directory more effectively after interactive development (see next section). # Interactive development Set your working directory to `src/` and any orderly script should be fully executable (e.g., source with Rstudio's `Source` button, or R's `source()` function). Dependencies will be copied over as needed. After doing this, you will have a mix of files within your source directory. We recommend a per-source-directory `.gitignore` which will keep these files out of version control (see below). We will soon implement support for cleaning up generated files from this directory. For example, suppose that we have interactively run our `incoming_data/incoming_data.R` script, we would leave behind generated files. We can report on this with `orderly2::orderly_cleanup_status`: ```{r, include = FALSE} withr::with_dir(file.path(path, "src/incoming_data"), sys.source("incoming_data.R", new.env(parent = .GlobalEnv))) ``` ```{r, orderly_root = path} orderly2::orderly_cleanup_status("incoming_data") ``` If you have files here that are unknown to orderly it will tell you about them and prompt you to tell it about them explicitly. You can clean up generated files by running (as suggested in the message): ```{r, orderly_root = path} orderly2::orderly_cleanup("incoming_data") ``` There is a `dry_run = TRUE` argument you can pass if you want to see what would be deleted without using the status function. You can also keep these files out of git by using the `orderly2::orderly_gitignore_update` function: ```{r, orderly_root = path} orderly2::orderly_gitignore_update("incoming_data") ``` This creates (or updates) a `.gitignore` file within the report so that generated files will not be included by git. If you have already accidentally committed them then the gitignore has no real effect and you should do some git surgery, see the git manuals or this [handy, if profane, guide](https://ohshitgit.com/). # Deleting things from the archive If you delete packets from your `archive/` directory then this puts `orderly2` into an inconsistent state with its metadata store. Sometimes this does not matter (e.g., if you delete old copies that would never be candidates for inclusion with `orderly2::orderly_dependency` you will never notice). However, if you delete the most recent copy of a packet and then try and depend on it, you will get an error. At the moment, we have two copies of the `incoming_data` task: ```{r, orderly_root = path} orderly2::orderly_metadata_extract( name = "incoming_data", extract = c(time = "time.start")) ``` ```{r include = FALSE, orderly_root = path} id_latest <- orderly2::orderly_search("latest", name = "incoming_data") fs::dir_delete(file.path(path, "archive", "incoming_data", id_latest)) ``` When we run the `analysis` task, it will pull in the most recent version (`r inline(id_latest)`). However, if you had deleted this manually (e.g., to save space or accidentally) or corrupted it (e.g., by opening some output in Excel and letting it save changes) it will not be able to be included, and running `analysis` will fail: ```{r error = TRUE, orderly_root = path} orderly2::orderly_run("analysis") ``` The error here tries to be fairly informative, telling us that we failed because when copying files from `r inline(id_latest)` we found that the packet was corrupt, because the file `data.rds` was not found in the archive. It also suggests a fix; we can tell `orderly2` that `r inline(id_latest)` is "orphaned" and should not be considered for inclusion when we look for dependencies. We can carry out the suggestion and just validate this packet by running ```{r echo = FALSE, results = "asis"} r_output( sprintf('orderly2::orderly_validate_archive("%s", action = "orphan")', id_latest)) ``` or we can validate *all* the packets we have: ```{r, orderly_root = path} orderly2::orderly_validate_archive(action = "orphan") ``` If we had the option `core.require_complete_tree` enabled, then this process would also look for any packets that used our now-deleted packet and orphan those too, as we no longer have a complete tree that includes them. If you want to remove references to the orphaned packets, you can use `orderly2::orderly_prune_orphans()` to remove them entirely: ```{r, orderly_root = path} orderly2::orderly_prune_orphans() ``` # Debugging and coping with errors (To be written) # Interaction with version control Some guidelines: Make sure to exclude some files from `git` by listing them in `.gitignore`: - `.outpack/` - nothing in here is suitable for version control - `archive/` - if you have `core.archive_path` set to a non-null value, this should be excluded. The default is `archive` - `draft/` - the temporary draft directory - `orderly_envir.yml` - used for setting machine-specific configuration You absolutely should version control some files: - `src/` the main source of your analyses - `orderly_config.yml` - this high level configuration is suitable for sharing - Any shared resource directory (configured in `orderly_config.yml`) should probably be version controlled Your source repository will end up in multiple people's machines, each of which are configured differently. The configuration option set via `orderly2::orderly_config_set` are designed to be (potentially) different for different users, so this configuration needs to be not version controlled. It also means that reports/packets can't directly refer to values set here. This includes the directory used to save archive packets at (if enabled) and the names of locations (equivalent to git remotes). You may find it useful to include scripts that help users set up common locations, but like with git, different users may use different names for the same remote (e.g., one user may have a location called `data` while for another it is called `data-incoming`, depending on their perspective about the use of the location). `orderly2` will always try and save information about the current state of the git source repository alongside the packet metadata. This includes the current branch, commit (sha) and remote url. This is to try and create links between the final version of the packet and the upstream source repository. # Interaction with the outpack store As alluded to above, the `.outpack` directory contains lots of information about packets that have been run, but is typically "out of bounds" for normal use. This is effectively the "database" of information about packets that have been run. Understanding how this directory is structured is not required for using orderly, but is included here for the avoidance of mystery! See the outpack documentation (`vignette("outpack")`) for more details about the ideas here. After all the work above, our directory structure looks like: ```{r, echo = FALSE} dir_tree(path, ".outpack", all = TRUE) ``` As can be perhaps inferred from the filenames, the files `.outpack/metadata/` are the metadata for each packet as it has been run. The files `.outpack/location//` holds information about when the packet was first known about by a location (here the location is the special "local" location). The default orderly configuration is to store the final files in a directory called `archive/`, but alternatively (or additionally) you can use a [content- addressable](https://en.wikipedia.org/wiki/Content-addressable_storage) file store. With this enabled, the `.outpack` directory looks like: ```{r, echo = FALSE, orderly_root = path} orderly2::orderly_config_set(core.use_file_store = TRUE) dir_tree(path, ".outpack", all = TRUE) ``` The files under `.outpack/files/` should never be modified or deleted. This approach to storage naturally deduplicates the file archive, so that a large file used in many places is only ever stored once. # Relationship between `orderly` and `outpack` The `orderly2` package is built on a metadata and file storage system called `outpack`; we will be implementing support for working with these metadata archives in other languages (see [`outpack_server`](https://github.com/mrc-ide/outpack_server) for our server implementation in Rust and [`outpack-py`](https://github.com/mrc-ide/outpack-py) in Python). The metadata is discussed in more detail in `vignette("metadata")` and we will document the general ideas more fully at [`mrc-ide/outpack`](https://github.com/mrc-ide/outpack)