orderly2

orderly2 is a package designed with two complementary goals in mind:

  1. to make analyses reproducible without significant effort on the part of the analyst
  2. to make it easy to collaborate on analyses by allowing easy sharing of artefacts from an analysis amongst a group of analysts.

In this vignette we will expand on these two aims, and show that the first one is a prerequisite for the second. The second is more interesting though and we start there. If you just want to get started using orderly2, you might prefer vignette("introduction") and if you are already familiar with version 1 you might prefer vignette("migrating").

Collaborative analysis

Many analyses only involve a single person and a single machine; in this case there are any number of workflow tools that will make orchestrating this analysis easy. In a workflow model you have a graph of dependencies over your analysis, over which data flows. So for example you might have

[raw data] -> [processed data] -> [model fits] -> [forecasts] -> [report]

If you update the data, the whole pipeline should rerun. But if you update the code for the forecast, then only the forecasts and report should rerun.

In our experience, this model works well for a single-user setting but falls over in a collaborative setting, especially where the analysis is partitioned by person; so Alice is handling the data pipeline, Bob is running fits, while Carol is organising forecasts and the final report. In this context, changes upstream affect downstream analysis, and require the same sort of care around integration as you might be used to with version controlling source code.

For example, if Alice is dealing with a change in the incoming data format which is going to break the analysis at the same time that Bob is trying to get the model fits working, Bob should not be trying to integrate both his code changes and Alice’s new data. We typically deal with this for source code by using branches within git; for the code Bob would work on a branch that is isolated from Alice’s changes. But in most contexts like this you will not have (and should not have) the data and analysis products in git. What is needed is a way of versioning the outputs of each step of analysis and controlling when these are integrated into subsequent analyses.

Another way of looking at the problem is that we seek a way of making analysis composable in the same way that functions and OOP achieve for programs, or the way that docker and containerisation have achieved for deploying software. To do this we need a way of putting interfaces around pieces of analysis and to allow people to refer to them and fetch them from somewhere where they have been run.

The conceptual pieces that are needed here are:

  • some way of referring unambiguously (and globally) to a piece of analysis so that it can be depended upon, and so that everyone can agree they’re talking about the same piece of analysis
  • a system of storage so that results of running analysis can be shared among a group
  • a system to control how integration of these pieces of analysis takes place

We refer to a transportable unit of analysis as a “packet”. This conceptually is a directory of files created by running some code, and is our atomic unit of work from the point of view of orderly. Each packet has an underlying source form, which anyone can run. However, most of the time people will use pre-run packets that they or their collaborators have run as inputs to onward analyses (see vignette("dependencies") and vignette("collaboration") for more details).

Reproducible analyses

Any degree of collaboration in the style above requires reproducibility, but there are several aspects of this.

With the system we describe here, even though everyone can typically run any step of an analysis, they typically don’t. This differs from workflow tools, which users may be familiar with.

Difference from a workflow system

Workflow systems have been hugely influential in scientific computing, from people co-opting build systems like make through to sophisticated systems designed for parallel running of large and complex workflows such as nextflow. The general approach is to define interdependencies among parts of an analysis, forming a graph over parts of an analysis and track inputs and outputs through the workflow.

This model of computation has lots of good points:

  • it defines an interface over an analysis and allows (and encourages) breaking a monolithic analysis into component pieces which can be reasoned about
  • it allows high-level parallelism by making obvious parts of the workflow that can be run concurrently
  • by tracking the way data flows through an analysis, it allows the minimum amount of recalculation to be done on change, with only downstream parts triggered
  • with a shared workspace or online runner, allows a degree of collaboration so long as everyone is happy to be working with a constantly changing set of code and analysis artefacts

We have designed orderly2 for working patterns that do not suit the above. Some motivating reasons include:

  • Some nodes in the computational graph are very expensive to compute or require exotic hardware.
  • Workflows where the upstream data never settle, but we need to know which version of the data ends up used in a particular analysis
  • Workflows where upstream analyses are used in many downstream analyses, and where the upstream developer may not know much about the downstream use
  • Nondeterministic analyses, e.g., those involving stochastic simulations, where rerunning a node is not expected to return the exact same numerical results, and so we can’t rely on different users recovering the same results on different occasions[*]

In all these cases the missing piece we need is a way of versioning the nodes within the computational graph, and shifting the emphasis from automatically rerunning portions of the graph to tracking how data has flowed through the graph. This in turn shifts the reproducibility emphasis from “everyone will run the same code and get the same results” to “everyone could run the same code, but will instead work with the results”.

For those familiar with docker, our approach is similar to working with pre-built docker images, whereas the workflow approach is more similar to working directly with Dockerfiles; in many situations the end result is the same, but the approaches differ in guarantees, in where the computation happens, and in how users refer to versions.

[*] We discourage trying to force determinism by manually setting seeds, as this has the potential to violate the statistical properties of random number streams, and is fragile at best.

What is reproducibility anyway?

Reproducibility means different things to different people, even within the narrow sense of “rerun an analysis and retrieve the same results”. In the last decade, the idea that one should be able to rerun a piece of analysis and retrieve the same results has slightly morphed into one must rerun a piece of analysis. Similarly, the emphasis on the utility of reproducibility has shifted from authors being able to rerun their own work (or have confidence that they could rerun it) to some hypothetical third party wanting to rerun an analysis.

Our approach flips the perspective around a bit, based on our experiences with collaborative research projects, and draws from an (overly) ambitious aim we had:

Can we prove that a given set of inputs produced a given set of outputs?

We quickly found that this was impossible, but provided a few systems were in place one could be satisfied with this statement to a given level of trust in a system. So if a piece of analysis comes from a server where the primary way people run analyses is through our web front-end (currently OrderlyWeb, soon to be Packit) we know that the analysis was run end-to-end with no modification and that orderly2 preserves inputs alongside outputs so the files that are present in the final packet were the files that went into the analysis, and the recorded R and package versions were the full set that were used.

Because this system naturally involves running on multiple machines (typically we will have the analysts’ laptops, a server and perhaps an HPC environment), and because of the way that orderly2 treats paths, practically there is very little problem getting analyses working in multiple places, trivially satisfying the typical reproducibility aim, even though it is not what people are typically focussed on.

This shift in focus has proved valuable. In any analysis that is run on more than one occasion (e.g., regular reporting, or simply updating a figure for a final submission of a manuscript after revision), the outputs may change. Understanding why these changes have happened is important. Because orderly2 automatically saves a lot of metadata about what was run it is easy to find out why things might have changed. Further, you can start interrogating the graph among packets to find out what effect that change has had; so find all the previously run packets that pulled in the old version of a data set, or that used the previous release of a package.