This vignette provides a how-to
style introduction to orderly2
, an overview of key
ingredients to writing orderly reports, and a summary of key features
and ideas. It may be useful to look at vignette("orderly2")
for a more roundabout discussion of what orderly2
is trying
to achieve, or vignette("migrating")
if you are familiar
with version 1 of orderly as this explains concepts in terms of
differences from the previous version.
The first step is to initialise an empty orderly2
repository. An orderly2
repository is a directory with the
file orderly_config.yml
within it, and since version 2 also
a directory .outpack/
. Files within the
.outpack/
directory should never be directly modified by
users and this directory should be excluded from version control (see
orderly2::orderly_gitignore_update
).
Create an orderly2 repository by calling
orderly2::orderly_init()
:
path <- tempfile() # we'll use a temporary directory here - see note below
orderly2::orderly_init(path)
## ✔ Created orderly root at '/tmp/RtmpAzfPqu/filefd262172815'
which creates a few files:
## .
## ├── .outpack
## │ ├── config.json
## │ ├── location
## │ └── metadata
## └── orderly_config.yml
This step should be performed on a completely empty directory,
otherwise an error will be thrown. Later, you will re-initialise an
orderly2
repository when cloning to a new machine, such as
when working with others; this is discussed in
vignette("collaboration")
.
The orderly_config.yml
file contains very little by
default:
For this vignette, the created orderly root is in R’s per-session
temporary directory, which will be deleted once R exits. If you want to
use a directory that will persist across restarting R (which you would
certainly want when using orderly2
on a real project!) you
should replace this with a path within your home directory, or other
location that you control.
For the rest of the vignette we will evaluate commands from within this directory, by changing the directory to the path we’ve created:
An orderly report is a directory src/<name>
containing an orderly file <name>.R
. That file may
have special commands in it, but for now we’ll create one that is as
simple as possible; we’ll create some random data and save it to disk.
This seems silly, but imagine this standing in for something like:
Our directory structure (ignoring .outpack
) looks
like:
## .
## ├── orderly_config.yml
## └── src
## └── incoming_data
## ├── data.csv
## └── incoming_data.R
and src/incoming_data/incoming_data.R
contains:
To run the report and create a new packet, use
orderly2::orderly_run()
:
id <- orderly2::orderly_run("incoming_data")
## ℹ Starting packet 'incoming_data' `20241106-132352-6b446405` at 2024-11-06 13:23:52.424068
## > d <- read.csv("data.csv")
## > d$z <- resid(lm(y ~ x, d))
## > saveRDS(d, "data.rds")
## ✔ Finished running 'incoming_data.R'
## ℹ Finished 20241106-132352-6b446405 at 2024-11-06 13:23:52.449614 (0.0255456 secs)
id
## [1] "20241106-132352-6b446405"
The id
that is created is a new identifier for the
packet that will be both unique among all packets (within reason) and
chronologically sortable. A packet that has an id that sorts after
another packet’s id was started before that packet.
Having run the report, our directory structure looks like:
## .
## ├── archive
## │ └── incoming_data
## │ └── 20241106-132352-6b446405
## │ ├── data.csv
## │ ├── data.rds
## │ └── incoming_data.R
## ├── draft
## │ └── incoming_data
## ├── orderly_config.yml
## └── src
## └── incoming_data
## ├── data.csv
## └── incoming_data.R
A few things have changed here:
data.rds
; see the script above)incoming_data.R
and data.csv
, the original
input that have come from our source treedraft/incoming_data
which
was created when orderly ran the report in the first placeIn addition, quite a few files have changed within the
.outpack
directory, but these are not covered here.
That’s it! Notice that the initial script is just a plain R script,
and you can develop it interactively from within the
src/incoming_data
directory. Note however, that any paths
referred to within will be relative to src/incoming_data
and not the orderly repository root. This is important
as all reports only see the world relative to their
incoming_data.R
file.
Once created, you can then refer to this report by id and pull its
files wherever you need them, both in the context of another orderly
report or just to copy to your desktop to email someone. For example, to
copy the file data.rds
that we created to some location
outside of orderly’s control you could do
dest <- tempfile()
fs::dir_create(dest)
orderly2::orderly_copy_files(id, files = c("final.rds" = "data.rds"),
dest = dest)
which copies data.rds
to some new temporary directory
dest
with name final.rds
. This uses
orderly2
’s outpack_
functions, which are
designed to interact with outpack archives regardless of how they were
created (orderly2
is a program that creates
outpack
archives). Typically these are lower-level than
orderly_
functions.
Creating a new dataset is mostly useful if someone else can use it. To do this we introduce the first of the special orderly commands that you can use from an orderly file
The src/
directory now looks like:
## src
## ├── analysis
## │ └── analysis.R
## └── incoming_data
## ├── data.csv
## └── incoming_data.R
and src/analysis/analysis.R
contains:
orderly2::orderly_dependency("incoming_data", "latest()",
c("incoming.rds" = "data.rds"))
d <- readRDS("incoming.rds")
png("analysis.png")
plot(y ~ x, d)
dev.off()
Here, we’ve used orderly2::orderly_dependency()
to pull
in the file data.rds
from the most recent version
(latest()
) of the data
packet with the
filename incoming.rds
, then we’ve used that file as normal
to make a plot, which we’ve saved as analysis.png
.
We can run this just as before, using
orderly2::orderly_run()
:
id <- orderly2::orderly_run("analysis")
## ℹ Starting packet 'analysis' `20241106-132352-9352c197` at 2024-11-06 13:23:52.580375
## > orderly2::orderly_dependency("incoming_data", "latest()",
## + c("incoming.rds" = "data.rds"))
## ℹ Depending on incoming_data @ `20241106-132352-6b446405` (via latest(name == "incoming_data"))
## > d <- readRDS("incoming.rds")
## > png("analysis.png")
## > plot(y ~ x, d)
## > dev.off()
## png
## 2
## ✔ Finished running 'analysis.R'
## ℹ Finished 20241106-132352-9352c197 at 2024-11-06 13:23:52.637803 (0.05742741 secs)
For more information on dependencies, see
vignette("dependencies")
.
The function orderly2::orderly_dependency()
is designed
to operate while the packet runs. These functions all act by adding
metadata to the final packet, and perhaps by copying files into the
directory.
orderly2::orderly_description()
: Provide a longer name
and description for your report; this can be reflected in tooling that
uses orderly metadata to be much more informative than your short
name.orderly2::orderly_parameters()
: Declares parameters
that can be passed in to control the behaviour of the report. Parameters
are key-value pairs of simple data (booleans, numbers, strings) which
your report can respond to. They can also be used in queries to
orderly2::orderly_dependency()
to find packets that satisfy
some criteria.orderly2::orderly_resource()
: Declares that a file is a
resource; a file that is an input to the the report, and which
comes from this source directory. By default, orderly treats all files
in the directory as a resource, but it can be useful to mark these
explicitly, and necessary to do so in “strict mode” (see below). Files
that have been marked as a resource are immutable and
may not be deleted or modified.orderly2::orderly_shared_resource()
: Copies a file from
the “shared resources” directory shared/
, which can be data
files or source code located at the root of the orderly repository. This
can be a reasonable way of sharing data or commonly used code among
several reports.orderly2::orderly_artefact()
: Declares that a file (or
set of files) will be created by this report, before it is even run.
Doing this makes it easier to check that the report behaves as expected
and can allow reasoning about what a related set of reports will do
without running them. By declaring something as an artefact (especially
in conjunction with “strict mode”) it is also easier to clean up
src
directories that have been used in interactive
development (see below).orderly2::orderly_dependency()
: Copy files from one
packet into this packet as it runs, as seen above.orderly2::orderly_strict_mode()
: Declares that this
report will be run in “strict mode” (see below).In addition, there is also a function
orderly::orderly_run_info()
that can be used while running
a report that returns information about the currently running report
(its id, resolved dependencies etc).
Let’s add some additional annotations to the previous reports:
orderly2::orderly_strict_mode()
orderly2::orderly_resource("data.csv")
orderly2::orderly_artefact("Processed data", "data.rds")
d <- read.csv("data.csv")
d$z <- resid(lm(y ~ x, d))
saveRDS(d, "data.rds")
Here, we’ve added a block of special orderly commands; these could go
anywhere, for example above the files that they refer to. If strict mode
is enabled (see below) then orderly2::orderly_resource
calls must go before the files are used as they will only be made
available at that point (see below).
id <- orderly2::orderly_run("incoming_data")
## ℹ Starting packet 'incoming_data' `20241106-132352-b404086e` at 2024-11-06 13:23:52.707717
## > orderly2::orderly_strict_mode()
## > orderly2::orderly_resource("data.csv")
## > orderly2::orderly_artefact("Processed data", "data.rds")
## Warning: Please use a named argument for the description in 'orderly_artefact()'
## In future versions of orderly, we will change the order of the arguments to
## 'orderly_artefact()' so that 'files' comes first. If you name your calls to
## 'description' then you will be compatible when we make this change.
## > d <- read.csv("data.csv")
## > d$z <- resid(lm(y ~ x, d))
## > saveRDS(d, "data.rds")
## ✔ Finished running 'incoming_data.R'
## ! 1 warning found:
## • Please use a named argument for the description in 'orderly_artefact()' In
## future versions of orderly, we will change the order of the arguments to
## 'orderly_artefact()' so that 'files' comes first. If you name your calls to
## 'description' then you will be compatible when we make this change.
## ℹ Finished 20241106-132352-b404086e at 2024-11-06 13:23:52.750877 (0.04315996 secs)
Much of the flexibility that comes from the orderly graph comes from using parameterised reports; these are reports that take a set of parameters and then change behaviour based on these parameters. Downstream reports can depend on a parameterised report and filter based on suitable parameters.
For example, consider a simple report where we generate samples based on some parameter:
orderly2::orderly_parameters(n_samples = 10)
x <- seq_len(n_samples)
d <- data.frame(x = x, y = x + rnorm(n_samples))
saveRDS(d, "data.rds")
This creates a report that has a single parameter
n_samples
with a default value of 10. We could have
used
to define a parameter with no default, or defined multiple parameters with
You can do anything in your report that switches on the value of a parameter:
However, you should see parameters as relatively heavyweight things and try to have a consistent set over all packets created from a report. In this report we use it to control the size of the generated data set.
id <- orderly2::orderly_run("random", list(n_samples = 15))
## ℹ Starting packet 'random' `20241106-132352-d2628a7b` at 2024-11-06 13:23:52.826673
## ℹ Parameters:
## • n_samples: 15
## > orderly2::orderly_parameters(n_samples = 10)
## > x <- seq_len(n_samples)
## > d <- data.frame(x = x, y = x + rnorm(n_samples))
## > saveRDS(d, "data.rds")
## ✔ Finished running 'random.R'
## ℹ Finished 20241106-132352-d2628a7b at 2024-11-06 13:23:52.857925 (0.03125191 secs)
Our resulting file has 15 rows, as the parameter we passed in affected the report:
orderly2::orderly_copy_files(id, files = c("random.rds" = "data.rds"),
dest = dest)
readRDS(file.path(dest, "random.rds"))
## x y
## 1 1 0.6035102
## 2 2 3.5660461
## 3 3 2.9115280
## 4 4 3.8792402
## 5 5 5.2213999
## 6 6 6.1335432
## 7 7 7.9798579
## 8 8 8.4627980
## 9 9 9.5190312
## 10 10 10.3391017
## 11 11 11.0803583
## 12 12 10.9912725
## 13 13 12.3441804
## 14 14 15.2351365
## 15 15 13.7726240
You can use these parameters in orderly’s search functions. For example we can find the most recent version of a packet by running:
But we can also pass in parameter queries here:
orderly2::orderly_search('latest(name == "random" && parameter:n_samples > 10)')
## [1] "20241106-132352-d2628a7b"
These can be used within orderly2::orderly_dependency()
(the name == "random"
part is implied by the first
name
argument), for example
orderly2::orderly_dependency("random", "latest(parameter:n_samples > 10)",
c("randm.rds" = "data.rds"))
In this case if the report that you are querying from also
has parameters you can use these within the query, using the
this
prefix. So suppose our downstream report simply uses
n
for the number of samples we might write:
orderly2::orderly_dependency("random", "latest(parameter:n_samples == this:n)",
c("randm.rds" = "data.rds"))
to depend on the most recent packet called random
where
it has a parameter n_samples
which has the same value as
the current report’s parameter n
.
See the outpack query documentation for much more detail on this.
The previous version of orderly (orderly1
; see
vignette("migrating")
) was very fussy about all input being
strictly declared before a report could be run, so that it was clear
what was really required in order to run something. From version 2 this
is relaxed by default, but you can opt into most of the old behaviours
and checks by adding
anywhere within your orderly file (conventionally at the top). We may make this more granular in future, but by adding this we:
src/<reportname>/
) to the draft directory where the
report runs (draft/<reportname>/<packet-id>
)
that were declared with orderly2::orderly_resource
; this
leaves behind any extra files left over in developmentUsing strict mode also helps orderly2
clean up the
src/<reportname>
directory more effectively after
interactive development (see next section).
Set your working directory to src/<reportname>
and
any orderly script should be fully executable (e.g., source with
Rstudio’s Source
button, or R’s source()
function). Dependencies will be copied over as needed.
After doing this, you will have a mix of files within your source
directory. We recommend a per-source-directory .gitignore
which will keep these files out of version control (see below). We will
soon implement support for cleaning up generated files from this
directory.
For example, suppose that we have interactively run our
incoming_data/incoming_data.R
script, we would leave behind
generated files. We can report on this with
orderly2::orderly_cleanup_status
:
orderly2::orderly_cleanup_status("incoming_data")
## ✖ incoming_data is not clean:
## ℹ 1 file can be deleted by running 'orderly2::orderly_cleanup("incoming_data")':
## • data.rds
If you have files here that are unknown to orderly it will tell you about them and prompt you to tell it about them explicitly.
You can clean up generated files by running (as suggested in the message):
There is a dry_run = TRUE
argument you can pass if you
want to see what would be deleted without using the status function.
You can also keep these files out of git by using the
orderly2::orderly_gitignore_update
function:
This creates (or updates) a .gitignore
file within the
report so that generated files will not be included by git. If you have
already accidentally committed them then the gitignore has no real
effect and you should do some git surgery, see the git manuals or this
handy, if profane, guide.
If you delete packets from your archive/
directory then
this puts orderly2
into an inconsistent state with its
metadata store. Sometimes this does not matter (e.g., if you delete old
copies that would never be candidates for inclusion with
orderly2::orderly_dependency
you will never notice).
However, if you delete the most recent copy of a packet and then try and
depend on it, you will get an error.
At the moment, we have two copies of the incoming_data
task:
orderly2::orderly_metadata_extract(
name = "incoming_data",
extract = c(time = "time.start"))
## id time
## 1 20241106-132352-6b446405 2024-11-06 13:23:52
## 2 20241106-132352-b404086e 2024-11-06 13:23:52
When we run the analysis
task, it will pull in the most
recent version (20241106-132352-b404086e
). However, if you
had deleted this manually (e.g., to save space or accidentally) or
corrupted it (e.g., by opening some output in Excel and letting it save
changes) it will not be able to be included, and running
analysis
will fail:
orderly2::orderly_run("analysis")
## ℹ Starting packet 'analysis' `20241106-132353-4529e435` at 2024-11-06 13:23:53.275024
## > orderly2::orderly_dependency("incoming_data", "latest()",
## + c("incoming.rds" = "data.rds"))
## ✖ Error running 'analysis.R'
## ℹ Finished 20241106-132353-4529e435 at 2024-11-06 13:23:53.349101 (0.07407689 secs)
## Error in `orderly2::orderly_run()`:
## ! Failed to run report
## Caused by error in `orderly_copy_files()`:
## ! Unable to copy files, due to deleted packet 20241106-132352-b404086e
## ℹ Consider 'orderly2::orderly_validate_archive("20241106-132352-b404086e",
## action = "orphan")' to remove this packet from consideration
## Caused by error:
## ! File not found in archive
## ✖ data.rds
The error here tries to be fairly informative, telling us that we
failed because when copying files from
20241106-132352-b404086e
we found that the packet was
corrupt, because the file data.rds
was not found in the
archive. It also suggests a fix; we can tell orderly2
that
20241106-132352-b404086e
is “orphaned” and should not be
considered for inclusion when we look for dependencies.
We can carry out the suggestion and just validate this packet by running
or we can validate all the packets we have:
orderly2::orderly_validate_archive(action = "orphan")
## ✔ 20241106-132352-6b446405 (incoming_data) is valid
## ✔ 20241106-132352-9352c197 (analysis) is valid
## ✖ 20241106-132352-b404086e (incoming_data) is invalid due to its files
## ✔ 20241106-132352-d2628a7b (random) is valid
## ✔ 20241106-132353-01bd0bbc (use_shared) is valid
If we had the option core.require_complete_tree
enabled,
then this process would also look for any packets that used our
now-deleted packet and orphan those too, as we no longer have a complete
tree that includes them.
If you want to remove references to the orphaned packets, you can use
orderly2::orderly_prune_orphans()
to remove them
entirely:
(To be written)
Some guidelines:
Make sure to exclude some files from git
by listing them
in .gitignore
:
.outpack/
- nothing in here is suitable for version
controlarchive/
- if you have core.archive_path
set to a non-null value, this should be excluded. The default is
archive
draft/
- the temporary draft directoryorderly_envir.yml
- used for setting machine-specific
configurationYou absolutely should version control some files:
src/
the main source of your analysesorderly_config.yml
- this high level configuration is
suitable for sharingorderly_config.yml
) should probably be version
controlledYour source repository will end up in multiple people’s machines,
each of which are configured differently. The configuration option set
via orderly2::orderly_config_set
are designed to be
(potentially) different for different users, so this configuration needs
to be not version controlled. It also means that reports/packets can’t
directly refer to values set here. This includes the directory used to
save archive packets at (if enabled) and the names of locations
(equivalent to git remotes).
You may find it useful to include scripts that help users set up
common locations, but like with git, different users may use different
names for the same remote (e.g., one user may have a location called
data
while for another it is called
data-incoming
, depending on their perspective about the use
of the location).
orderly2
will always try and save information about the
current state of the git source repository alongside the packet
metadata. This includes the current branch, commit (sha) and remote url.
This is to try and create links between the final version of the packet
and the upstream source repository.
As alluded to above, the .outpack
directory contains
lots of information about packets that have been run, but is typically
“out of bounds” for normal use. This is effectively the “database” of
information about packets that have been run. Understanding how this
directory is structured is not required for using orderly, but is
included here for the avoidance of mystery! See the outpack
documentation (vignette("outpack")
) for more details about
the ideas here.
After all the work above, our directory structure looks like:
## .outpack
## ├── config.json
## ├── index
## │ └── outpack.rds
## ├── location
## │ ├── local
## │ │ ├── 20241106-132352-6b446405
## │ │ ├── 20241106-132352-9352c197
## │ │ ├── 20241106-132352-d2628a7b
## │ │ └── 20241106-132353-01bd0bbc
## │ └── orphan
## └── metadata
## ├── 20241106-132352-6b446405
## ├── 20241106-132352-9352c197
## ├── 20241106-132352-d2628a7b
## └── 20241106-132353-01bd0bbc
As can be perhaps inferred from the filenames, the files
.outpack/metadata/<packet-id>
are the metadata for
each packet as it has been run. The files
.outpack/location/<location-id>/<packet-id>
holds information about when the packet was first known about by a
location (here the location is the special “local” location).
The default orderly configuration is to store the final files in a
directory called archive/
, but alternatively (or
additionally) you can use a content-
addressable file store. With this enabled, the .outpack
directory looks like:
## .outpack
## ├── config.json
## ├── files
## │ └── sha256
## │ ├── 08
## │ │ └── 23ecf2c833d220e9cae85552974f1e6cd9ac37d2119a9b1e514ee425a1fc07
## │ ├── 0a
## │ │ └── a82571c21c4e5f1f435e8bef2328dda5ef47e177d78d63d1c4ec647a5a388a
## │ ├── 16
## │ │ └── bc89688963538171f38720fa1c257f6ea968be3c53225426ba5c94334842ce
## │ ├── 1e
## │ │ └── 187423f9509581f9355ae04f0e5e6f11e79d54088db9592fd5820470a7d7dd
## │ ├── 25
## │ │ └── 4947c281b203719c72949745123a1d017e2f9b50c048b1d24a0803d73ba0b8
## │ ├── 47
## │ │ └── efa92dfdb0c2b9b835605ca3d866a9aaf153fbd8e580c25039333827c5aed9
## │ ├── 5f
## │ │ └── 96f49230c2791c05706f24cb2335cd0fad5d3625dc6bca124c44a51857f3f8
## │ ├── 9e
## │ │ └── 2af7c62bc0c489705930add523acf70ccf4aaf9369bdc3bda9e075bf664990
## │ ├── ca
## │ │ └── 94e5fb0ce2a925cbc61998c6008e001007b7844f537696d94bf2d93db5e75c
## │ └── d9
## │ └── 1699ae410cbd811e1f028f8a732e5162b7df854eec08d921141f965851272d
## ├── index
## │ └── outpack.rds
## ├── location
## │ ├── local
## │ │ ├── 20241106-132352-6b446405
## │ │ ├── 20241106-132352-9352c197
## │ │ ├── 20241106-132352-d2628a7b
## │ │ └── 20241106-132353-01bd0bbc
## │ └── orphan
## └── metadata
## ├── 20241106-132352-6b446405
## ├── 20241106-132352-9352c197
## ├── 20241106-132352-d2628a7b
## └── 20241106-132353-01bd0bbc
The files under .outpack/files/
should never be modified
or deleted. This approach to storage naturally deduplicates the file
archive, so that a large file used in many places is only ever stored
once.
orderly
and
outpack
The orderly2
package is built on a metadata and file
storage system called outpack
; we will be implementing
support for working with these metadata archives in other languages (see
outpack_server
for our server implementation in Rust and outpack-py
in Python). The metadata is discussed in more detail in
vignette("metadata")
and we will document the general ideas
more fully at mrc-ide/outpack
orderly
and outpack