The orderly2
package
is the reference implementation at the moment of the outpack
specification; a collection of schemas and directory structures that
outpack requires. Once we release (or possibly before), we will split
this specification from the package, though the package will continue to
bundle a copy.
We make use of JSON schema to document the schemas used.
This vignette outlines the basic structure of files within the
.outpack/
directories, and is not itself an overview of how
outpack works; the primary audience is people working on outpack itself
(though a small introduction is provided below).
Each “packet” is conceptually a directory, corresponding to a particular analysis or data product, though this is not necessarily how it is stored. The internal representation includes:
Every packet is referenced uniquely by a primary key. We use a key format that encodes the current date and time, as well as random data to avoid collisions.
There exists some dependency graph among packets, as one packet depends on another. Each edge of this graph has a hard link (from one packet to another by an id) and also a query (e.g., latest packet with some name) which was used to find the packet. This means that there are many ways of looking at or thinking about the dependency graph.
Not all packets are available locally, some are on other outpack repositories, typically (but not always) on other machines and accessed over an HTTP API. These are conceptually similar to git “remotes”.
We will need to distinguish between packets which are “unpacked” (that is, packets with every file available in the current archive) and packets that are merely known about (those for which we have the metadata but not the files). We will sometimes refer to these unpacked packets as “local” as they are known to the “local” location which is special.
We use the terms “archive” and “repository” fairly interchangeably below and will try and nail that down.
Each packet must have a few things:
model_fits
). This cannot be changed
(or rather changes cannot be tracked) and there is not currently a way
of namespacing this between different repositoriesoutput/data.csv
) and also a hash (e.g.,
sha256:69f6cf230416cf40828da251a0dad17cbbf078587883e826f3345ff08d1aaa7d
)In addition it may contain information about:
There are a few types of “persona” of outpack user that we imagine exist and which guide some decisions abut layout below. At the extremes we have:
This impacts two configuration options and associated parts of the directory structure below:
We expect the first persona wants the human readable archive and not to contain a full tree, while the second wants the opposite.
This section discusses the files and directory that make outpack work, but not so much how these come to be; see below for that.
A typical .outpack
directory layout looks like this:
.outpack/
config.json
files/
location/
metadata/
archive/
(note that archive/
and .outpack
here are
at the same level). Not all of these directories will necessarily be
present; indeed the only required file is
.outpack/config.json
.
.outpack/config.json
)The outpack configuration schema is defined in config.json
The configuration format is still subject to change…
.outpack/metadata/
)Each file within this directory has a filename that is an outpack id
(matching the regular expression
^[0-9]{8}-[0-9]{6}-[0-9a-f]{8}$
, see below. Each file is a
json file conforming to the schema metadata.json
.
Being present here means that an outpack implementation can report information back about a packet (when it was created, what files it contains, etc), but packet metadata are not very meaningful on their own; we want to know where they might have come from (a location that is distributing this packet) and if we have a copy of the packet locally.
.outpack/location/
)This directory matches the regular expression ^[0-9]{8}$
(e.g., 457f4f2a
) and is a “location id” (see below)
corresponding to a “location”. Each file within this directory has an
outpack id as name, and contains json about when that location unpacked
(or installed) the packet, and the hash of the metadata. This file
conforms to the schema location.json
.
.outpack/files
)If the configuration option core.use_file_store
is
true
, then outpack keeps a content addressable file store
of all files that it knows about. This is much more space efficient than
having the entire packet unpacked as it automatically deduplicates
shared content among packets (e.g., if a large file is present in two
packets it will only be stored once). The file store layout is described
below.
This storage format is not human-readable (and indeed present only
within the hidden directory .outpack
). It can be enabled on
either server or user
archive/
by default)If the configuration option core.path_archive
is
non-null
then there will be a directory with that path
containing unpacked packets. Each packet will be available at the
path
archive/<name>/<id>/<files...>
With <name>
being the “name” of the packet,
<id>
being its outpack id. There will be several
files per packet, possibly themselves in directories. This storage
approach is designed to be human readable, and will typically only be
enabled where the outpack repository is being used on a laptop where a
user wants to interactively work with files.
In order to make a packet available locally, you need to import the metadata and the files, then mark the packet as available. This will be roughly the same if you are creating a packet (i.e., you are the first place where a packet has ever existed) or if you are importing a packet from elsewhere.
Making the packet available allows it to be used as a dependency, allows serving that packet if you are acting as a location (over the http or file protocols), and guarantees that the files are actually present locally.
You can simply copy metadata as the file
.outpack/metadata/<packet id>
if it does not yet
exist. This does not make it available to anything yet as it is not
known from any location. Dangling metadata (that is, metadata present in
this directory but not known anywhere) is currently mostly ignored.
If the repository uses a file store, you should fill this first, because it is much easier to think about. You can easily get the difference between files used by a packet (the list of files in the packet manifest) and what you already have in the file store by looking up each hash in turn. You should then request and missing files and insert them into the store. This may leave “dangling” files for a while (files referred to by no packet) but that is not a problem.
If the repository has a human-readable archive and uses a file store, then after the files are all present in the file store it is easy enough to check them out of the file store to the requested path (the local relative path in the packet manifest). Because you update the file store first, all files are guaranteed to be present.
If the repository only uses a human readable archive, the simplest thing is to request each file from the remote. However, it might be more efficient to check locally for any previously fetched copies of files with the same content, verify that they have not been modified, and then copy those into place rather than re-downloading.
For your local location id, write out a file
.outpack/<local location id>/<packet id>
conforming to the location.json
schema, and containing the packet id, the time that it was marked as
unpacked and the hash of the metadata.
We only need both files and metadata once the packet is marked as unpacked; note that some configurations guarantee that every packet is unpacked in a complete tree.
You can import files first, or metadata; there is not a lot of disadvantage to either. You should only mark a package unpacked and known locally though once both components are present.
Outpack ids match the regular expression
^[0-9]{8}-[0-9]{6}-[0-9a-f]{8}$
; they are encode UTC
date-time as with the prefix YYYYMMDD-HHMMSS-
and are
followed by 8 hexadecimal digits. In the R implementation, we encode the
current second as the first four digits (2 bytes) and append 2 bytes of
cryptographically random data.
The id tries to balance a reasonable degree of collision resistance (65536 combinations per millisecond), lexicographic sortability and a reasonable degree of meaningfulness.
Location ids are meaningless 4-byte (8 character) hex strings. They
are immutable once created and are different between different machines
even if they point to the same location. This location id is then mapped
(via .outpack/config.json
) to a location name
which is a human-readable name (e.g., production
or
staging
). There is no requirement that this name is the
same for different machines.
One of these directories represents the local location; you can find that mapping within the configuration.
Outpack typically uses sha256
hashes, but we want to be
able to change this in future. So wherever a hash is presented, the
algorithm is included as part of the string. For example
sha256:69f6cf230416cf40828da251a0dad17cbbf078587883e826f3345ff08d1aaa7d
If we had instead used the md5 algorithm we would have written
md5:bd57f7123c6bfb95c3234ff56373b7f4
The schema currently assumes that the hash value is represented as a hex string.
We store information about times in a few places (e.g., times that a
packet was run, imported, etc). Rather than trying to deal with strings,
we always store time in seconds since
1970-01-01 00:00.00 UTC
(including fractional seconds, to
whatever accuracy your system allows).
The file store is designed to be simple, and is not as sophisticated as that in git, whose object store does a similar thing.
The general layout looks like:
.outpack/files
sha256/
5d/
dfaf1f4a2e15e8fe46dbed145bf2f84bba1b3367d0a56f73de08f8585dd153
...
77/
...
With hopefully a fairly obvious structure. Paths have format:
<algorithm>/<first two bytes>/<remaining bytes>
The reason for the second level is to prevent performance degradation with directories containing millions of files, again copying git.
The store is designed to cope with different hashing algorithms,
though the R implementation of outpack
only supports
sha256
for now.
Multiple hashing algorithms could be supported by hard linking content into multiple places with in the tree, so we might link
sha256/5d/dfaf1f4a2e15e8fe46dbed145bf2f84bba1b3367d0a56f73de08f8585dd153
as
md5/84/0bc6ad3ae479dccc1c49a1910b37bd