--- title: "Collaborative analysis" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Collaborative analysis} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} source("common.R") path <- tempfile() fs::dir_create(path) path_git <- file.path(path, "git") path_alice <- file.path(path, "alice") path_bob <- file.path(path, "bob") path_server <- file.path(path, "server") path_sharepoint_alice <- file.path(path, "sharepoint") path_sharepoint_bob <- path_sharepoint_alice src <- system.file("example", package = "orderly2", mustWork = TRUE) fs::dir_copy(src, path_git) gert::git_init(path_git) gert::git_add(".", repo = path_git) user <- "Alice " sha <- gert::git_commit("initial", author = user, committer = user, repo = path_git) if (gert::git_branch(repo = path_git) == "master") { gert::git_branch_move("master", "main", repo = path_git) } gert::git_clone(path_git, path_alice) gert::git_clone(path_git, path_bob) gert::git_clone(path_git, path_server) orderly2::orderly_init(path_server, use_file_store = TRUE, path_archive = NULL) knitr::opts_hooks$set(as = function(options) { options$class.source <- options$as options$class.output <- options$as options }) ``` ```{css, echo=FALSE} .sourceCode { position: relative; } .alice { /* border: 3px solid red; */ /* background-color: lightpink; */ /* font-weight: bold; */ padding-right: 30px } div pre.alice:after { content:"Alice"; position: absolute; right: 0px; top: 0px; font-size: 80%; font-weight: normal; font-family: sans-serif; padding: 0 5px; color: hotpink; } .bob { /* background-color: lightblue; */ /* border: 3px solid blue; */ /* font-weight: bold; */ padding-right: 30px } div pre.bob:after { content:"Bob"; position: absolute; right: 0px; top: 0px; font-size: 80%; font-weight: normal; font-family: sans-serif; padding: 0 5px; color: blue; } ``` One of the core purposes of `orderly2` is to allow collaborative workflows. For this to work we need to be able to find packets that were run by someone else, somewhere else, or to share packets that we have run with others. To do that, `orderly2` has the idea of "locations". Locations behave in a similar way to "remotes" in `git`, and some of the terminology for interacting with them is similar (notably we can `pull` from and perhaps `push` to a location). However, we have used a different name because every `orderly2` working directory has at least one location - the `local` location - which is not remote at all. We also imagine that almost everyone using `orderly2` will be using `git` and as `git` remotes and `orderly2` locations refer to totally different things this slightly different terminology may help keep things clearer. Packets that are distributed by locations might have been created by `orderly2`, but may have been created by some other system that implements the `outpack` spec (see `vignette("metadata")`). As a result, through this documentation you will see reference to software or details to do with `outpack` but if you only use `orderly2` then you can read that as being synonymous with `orderly2`. # Type of location `orderly2` supports two types of locations by default: * `path`: these are any `orderly2` working copy that you can find on your filesystem. Note that this could be a copy that other people can see (for example on a network share, or on a cloud-synced file system such as Dropbox or OneDrive) see [Sharing packets with collaborators using a shared file system] for more details * `http`: these require running an HTTP API, either via [`outpack_server`](https://github.com/mrc-ide/outpack_server) or [`packit`](https://github.com/mrc-ide/packit) The location system is somewhat extensible, but the details of this are subject to change. All the documentation below will behave the same regardless of where the location is stored and the mechanism of transport, so we can focus instead on workflows. # An example Suppose we have two researchers, [Alice and Bob](https://en.wikipedia.org/wiki/Alice_and_Bob), who are collaborating on a piece of analysis. For the purposes of this example (and we find the most common situation in practice) they share a source tree with their analysis in git. Both Alice and Bob clone this repo onto their machines, using their favourite git client (at the terminal with `git`, via GitHub desktop, or as here with [`gert`](https://r-lib.github.io/gert)) At the moment, including hidden files (except `.git`), Alice sees: ```{r, as = "alice", inwd = path_alice} fs::dir_tree(".", all = TRUE, glob = ".git", invert = TRUE) ``` At this point, after cloning Alice does not have the `.outpack` directory (see below) ```{r, as = "alice", inwd = path_alice, error = TRUE} orderly2::orderly_list_src() ``` Alice needs to run `orderly::orderly_init()` first: ```{r, as = "alice", inwd = path_alice} orderly2::orderly_init(".") orderly2::orderly_list_src() ``` The plan is to work with Bob, sharing results on their shared server, so Alice adds that to her `orderly2` configuration with `orderly2::orderly_location_add()`: ```{r, include = FALSE, inwd = path_alice} orderly2::orderly_location_add( "server", "path", list(path = path_server)) ``` ```{r, eval = FALSE, as = "alice"} orderly2::orderly_location_add( "server", "http", list(url = "http://orderly.example.com")) ``` Once done, she can run any analysis: ```{r, as = "alice", inwd = path_alice} id <- orderly2::orderly_run("data") ``` Perhaps it takes several goes for Alice to be happy with the analysis, but at some point she has something ready to share. She can then "push" the final packet up onto their server: ```{r, as = "alice", inwd = path_alice} orderly2::orderly_location_push(id, "server") ``` Now, consider Bob. He also needs the source code cloned, orderly initialised, and the server added: ```{r, as = "bob", inwd = path_bob} orderly2::orderly_init(".") ``` ```{r, include = FALSE, inwd = path_bob} orderly2::orderly_location_add( "server", "path", list(path = path_server)) ``` ```{r, eval = FALSE, as = "bob"} orderly2::orderly_location_add( "server", "http", list(url = "http://orderly.example.com")) ``` (Here, Bob has also used the name `server` to refer to their shared server, but could have used anything; this is similar to git's use of `origin`.) Bob can now query for packets available on the server: ```{r, as = "bob", inwd = path_bob} orderly2::orderly_metadata_extract( name = "data", options = list(allow_remote = TRUE, pull_metadata = TRUE)) ``` Having seen there is a new "data" packet here, he can pull this down locally (TODO: mrc-4414 makes this nicer): ```{r, as = "bob", inwd = path_bob} orderly2::orderly_location_pull_packet(id) ``` Now Bob is in a position to develop against the same packet that Alice ran (`r id`) # Possible working patterns We have seen several broad patterns of distributing packets. **Central server, push allowed**: This is the case as above; this is very flexible but requires that everyone is relaxed about how packets are created as you are ultimately trusting that your colleagues are exercising good environment hygiene. Just because rules are not being enforced by the computer though, doesn't mean that you might not have ideas within the group about who can or should push. It may be useful to only push from your HPC system after running computationally demanding tasks there. **Central server, final copy is run here**: In this case, Alice and Bob can both run things on the server, and these are the "canonical" copies, nothing is pushed to the server. Reports can be run on the server by `ssh`-ing onto the server and running packets manually (TODO: mrc-4412) or using the web interface (this was possible with `OrderlyWeb` and will be possible again with `packit`). This approach works well where a primary goal is confidence that the packets everyone works with as dependencies are created under a "clean" environment with no unexpected global dependencies. With the web version, you can also enforce things like only running off the default branch. Alice and Bob will then end up with a collection of packets in their local archives that are a mix of canonical ones (from the server) and locally-created ones, but the local ones are always a dead end as they never get shared with anyone. As a result, Alice and Bob may delete their archive directories without any great concern. When running their own packets, to make sure that the packet will run in a similar way to a packet run on the server, Bob may want to ensure that only dependencies that could be found on the server will be considered. To that, he can create search options ```{r, as="bob"} search_options <- orderly2::orderly_search_options(location = "server") ``` and then pass this in to `orderly2::orderly_run` when running a report. In developing, Bob may find passing this into `orderly2::orderly_interactive_set_search_options()` helpful as that will mean that any calls to `orderly2::orderly_description()` will use only packets from the server. **Staging and production servers**: You might set up multiple servers, one or more for staging, and one for production. The staging servers are never the source of truth and you might (or might not) allow people to push to them, or have less strict rules about what gets run on them (perhaps allowing running packets from a feature branch). In this setting, users will end up with packets that were created locally, from staging machines and from the production server in their local archive. The staging server will probably end up with a mix of its own packets and packets from production. The same approach with specifying search options as above will be useful here for choosing what is included in any final packet. # Pulling complete trees, or not When you pull packets from a server (with `orderly2::orderly_location_pull_packet()`), depending on your settings you will either end up with just the packet that you request, or that packet **plus** all of its recursive dependencies. This is controlled by the configuration option `core.require_complete_tree`, which is `FALSE` by default; this is suitable for most users, but servers should be set with `core.require_complete_tree = TRUE`. In most of the situations above, the users are never holding the canonical source of the packets - so it does not matter if they don't have all if their dependencies. They do hold all the metadata for all packets (again, recursively) and could download the dependencies later if they wanted to inspect them. Typically though, this would just be extra space used up on disk. When you push a packet to a server, it always pushes the complete tree. # Moving as little data as possible When pulling packets, first the user pulls metadata from the server. When `orderly2::orderly_location_metadata_pull` is run, all metadata from the server is updated on the client (mrc-4444 will make this lazier). We now know everything the server holds, and the hashes of all the files contained in each packet. It is very likely that we already know of some of the files within a packet being requested, so the client first looks to see if these exist within its archive, verifying that they have not been changed. It then requests from the server all the files that it cannot resolve locally. If doing a recursive pull (with `core.require_complete_tree = TRUE`) then we look at the union over all files in the set of packets that are currently missing from our archive. This is frequently much less than the full set of files within the packets. The algorithm is similar for a push: * look at the full dependency chain of our target packet and find all file ids and ask the server which if these it does not yet know about * in the unknown subset we get the union of file hashes and ask the server which subset of files it does not yet know about * we then push up files - this is done one at a time at present but could be done simultaneously with async requests in future * once all files are on the sever we upload the missing metadata, in chronological order and the server verifies that for each packet it has metadata for all dependencies and all referenced files This multi-step process means that we avoid copying data that is already known about, as well as avoiding moving the same data multiple times. It also means that if the process of pushing (or pulling) is interrupted it can be safely resumed as the state is always consistent. # Interaction between source tree and `.outpack` It is important that the `.outpack` directory is *not* shared via git; we warn about this now, and you can use `orderly2::ordery_gitignore_update()` to automatically create a suitable `.gitignore` file that will prevent it being accidentally committed. However, If Alice and Bob were starting on a new machine they would: * clone their source repository * run `orderly2::orderly_init()` from within the directory * run `orderly2` commands as normal Without running `orderly2::orderly_init()`, they will get an error prompting them to initialise the repo. There is no requirement for different repositories to share configuration options passed to `orderly2::orderly_init()`; so you can enable or disable the file store or have different sets of locations enabled, depending on your workflows. This means that there is, in practice, only an association _by convention_ between a set of related `orderly2` locations and their source tree and you can have an `orderly2` repository that points at locations that refer to different source trees! # Sharing packets with collaborators using a shared file system One of the simplest ways to share packets with a collaborator is through a shared file system, for example on a network share, or on a cloud-synced file system such as Dropbox or OneDrive. You can do this as follows. 1. Create a a new folder for your orderly remote in the shared file system. Here Alice has synced to `path_sharepoint_alice`. 1. Initialise an orderly location on the shared file system ```{r as="alice"} orderly2::orderly_init( root = path_sharepoint_alice, path_archive = NULL, use_file_store = TRUE, require_complete_tree = TRUE ) ``` Create an orderly store with a file store and a complete tree. See `orderly2::orderly_init()` for more details. 1. Add this as a location ```{r, as="alice", inwd = path_alice} orderly2::orderly_location_add( "sharepoint", "path", list(path = path_sharepoint_alice)) ``` 1. Push any packets you want to share ```{r, as="alice", inwd = path_alice} orderly2::orderly_location_push(id, "sharepoint") ``` Then these will be available for your collaborator to pull. Note that the data is pushed into a file store in the `.outpack` directory. `.dot` files are considered hidden by your operating system, so if you have "show hidden files" off in your file browser you will not see the pushed packet. But your collaborator can now pull the shared file. To do so, they will have to: 1. Sync the same drive to a location on their machine. Here Bob has synced to `path_sharepoint_bob` 1. Add the location ```{r, as="bob", inwd = path_bob} orderly2::orderly_location_add( "alices_orderly", "path", list(path = path_sharepoint_bob)) ``` 1. Pull the metadata and use the packets as desired ```{r, as="bob", inwd = path_bob} orderly2::orderly_location_pull_metadata( "alices_orderly" ) ```