--- title: "hipercow" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{hipercow} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} source("common.R") vignette_root <- new_hipercow_root_path() fs::file_copy("simulation.R", vignette_root) set_vignette_root(vignette_root) cleanup <- withr::with_dir( vignette_root, hipercow::hipercow_example_helper(with_logging = TRUE, new_directory = FALSE, initialise = FALSE)) ``` Parallel computing on a cluster can be more challenging than running things locally because it's often the first time that you need to package up code to run elsewhere, and when things go wrong it's more difficult to get information on why things failed. Much of the difficulty of getting things running involves working out what your code depends on, and getting that installed in the right place on a computer that you can't physically poke at. The next set of problems is dealing with the ballooning set of files that end up being created - templates, scripts, output files, etc. The `hipercow` package aims to remove some of this pain, with the aim that running a task on the cluster should be (almost) as straightforward as running things locally, at least once some basic setup is done. At the moment, this document assumes that we will be using the "Windows" cluster, which implies the existence of some future non-Windows cluster. Stay tuned. This manual is structured in escalating complexity, following the chain of things that a hypothetical user might encounter as they move from their first steps on the cluster through to running enormous batches of tasks. # Installing prerequisites Install the required packages from our "r-universe". Be sure to run this in a fresh session. ```r install.packages( "hipercow", repos = c("https://mrc-ide.r-universe.dev", "https://cloud.r-project.org")) ``` Once installed you can load the package with ```{r} library(hipercow) ``` or use the package by prefixing the calls below with `hipercow::`, as you prefer. Follow any platform-specific instructions in `vignettes("")`; this will depend on the cluster you intend to use: * Windows: `vignette("windows")` # Filesystems and paths We need a concept of a "root"; the point in the filesystem we can think of everything relative to. This will feel familiar to you if you have used git or orderly, as these all have a root (and this root will be a fine place to put your cluster work). Typically all paths will be *within* this root directory, and paths above it, or absolute paths in general, effectively cease to exist. If your project works this way then it's easy to move around, which is exactly what we need to do in order to run it on the cluster. If you are using RStudio, then we strongly recommend using an [RStudio project](https://support.posit.co/hc/en-us/articles/200526207-Using-RStudio-Projects). # Initialising Run ```{r} hipercow_init() ``` which will write things to a new path `hipercow/` within your working directory. After initialisation you will typically want to configure a "driver", which controls how tasks are sent to clusters. At the moment the only option is the windows cluster so for practical work you would write: ```r hipercow_configure(driver = "windows") ``` however, for this vignette we will use a special "example" driver which simulates what the cluster will do (don't use this for anything yourself, it really won't help): ```{r} hipercow_configure(driver = "example") ``` You can run initialisation and configuration in one step by running ```r hipercow_init(driver = "windows") ``` After initialisation and configuration you can see the computed configuration by running `hipercow_configuration()`: ```{r} hipercow_configuration() ``` Here, you can see versions of important packages, information about where you are working, and information about how you intend to interact with the cluster. See `vignette("windows")` for example output you might expect on the Windows cluster, which includes information about mapping of your paths onto those of the cluster, the version of R you will use, and other information. If you have issues with `hipercow` we will always want to see the output of `hipercow_configuration()`. # Running your first task The first time you use the tools (ever, in a while, or on a new machine) we recommend sending off a tiny task to make sure that everything is working as expected: ```{r create} id <- task_create_expr(sessionInfo()) ``` This creates a new task that will run the expression `sessionInfo()` on the cluster. The `task_create_expr()` function works by so-called ["non standard evaluation"](https://adv-r.hadley.nz/metaprogramming.html) and the expression is not evaluated from your R session, but sent to run on another machine. The `id` returned is just an ugly hex string: ```{r} id ``` ```{r, include = FALSE} # Ensure that the task is finished, makes the rest of the doc nicer to read task_wait(id) ``` Many other functions accept this `id` as an argument. You can get the status of the task, which will have finished now because it really does not take very long: ```{r status} task_status(id) ``` Once the task has completed you can inspect the result: ```{r result} task_result(id) ``` Because we are using the "example" driver here, this is the same as the result that you'd get running `sessionInfo()` directly, just with more steps. See `vignette("windows")` for an example that runs on Windows. # Using functions you have written It's unlikely that the code you want to run on the cluster is one of the functions built into R itself; more likely you have written a simulation or similar and you want to run *that* instead. In order to do this, we need to tell the cluster where to find your code. There are two broad places where code that you want to run is likely to be found **script files** and **packages**; we start with the former here, and deal with packages in much more detail in `vignette("packages")`. Suppose you have a file `simulation.R` containing some simulation: ```{r, results = "asis", echo = FALSE} r_output(readLines("simulation.R")) ``` We can't run this on the cluster immediately, because the cluster does not know about the new function: ```{r create_walk1} id <- task_create_expr(random_walk(0, 10)) task_wait(id) task_status(id) task_result(id) ``` (See `vignette("troubleshooting")` for more on failures.) We need to tell hipercow to `source()` the file `simulation.R` before running the task. To do this we use `hipercow_environment_create()` to create an "environment" (not to be confused with R's environments) in which to run things: ```{r create_environment} hipercow_environment_create(sources = "simulation.R") ``` Now we can run our simulation: ```{r create_walk2} id <- task_create_expr(random_walk(0, 10)) task_wait(id) task_result(id) ``` We have more to write on environments but briefly: * You can have multiple environments and each task can be set to run in a different environment * Each environment can source any number of source files, and load any number of packages * This will become the mechanism by which environments on parallel workers (via `parallel`, `future` or `rrq`) will set up their environments # Getting information about tasks Once you have created (and submitted) tasks, they will be queued by the cluster and eventually run. The hope is that we surface enough information to make it easy for you to see how things are going and what has gone wrong. ## Fetching information with `task_info()` The primary function for fetching information about a task is `task_info()`: ```{r} task_info(id) ``` This prints out core information about the task; its identifier (`r inline(id)`) and status (`success`), along with information about what sort of task it was, what expression it had, variables it used, the environment it executed in and the time that key events happened for the task (when it was created, started and finished). This display is meant to be friendly; if you need to compute on this information, you can access the times by reading the `$times` element of the `task_info()` return value: ```{r} task_info(id)$times ``` Likewise, the information about the task itself is within `$data`. To work with the underling data you might just unclass the object to see the structure: ```{r} unclass(task_info(id)) ``` but note that the exact structure is subject to (infrequent) change. ## Fetching logs with `task_log_show` Every task will produce some logs, and these can be an important part of understanding what they did and why they went wrong. You can view the log with `task_log_show()` ```{r} task_log_show(id) ``` This prints the contents of the logs to the screen; you can access the values directly with `task_log_value(id)`. The format of the logs will be generally the same for all tasks; after the header saying where we are running, some information about the task will be printed (its identifier, the time, details about the task itself), then any logs that come from calls to `message()` and `print()` within the queued function (within the "task logs" section; here that is empty because our task prints nothing). Finally, a summary will be printed with the final status, final time (and elapsed time), then any warnings that were produced will be flushed (see `vignette("troubleshooting")` for more on warnings). There is a second log too, the "outer" log, which is generally less interesting so it is not the default. These logs come from the cluster scheduler itself and show the startup process that leads up to (and after) the code that hipercow itself runs. It will differ from driver to driver. In addition, this log may not be available forever; the windows cluster retains it only for a couple of weeks: ```{r} task_log_show(id, outer = TRUE) ``` The logs returned by `task_log_show(id, outer = FALSE)` are the logs generated by the statement containing `Rscript -e`. ## Watching logs with `task_log_watch` If your task is still running, you can stream logs to your computer using `task_log_watch()`; this will print new logs line-by-line as they arrive (with a delay of up to 1s by default). This can be useful while debugging something to give the illusion that you're running it locally. Using `Ctrl-C` (or `ESC` in RStudio) to escape will only stop log streaming and not the underlying task. # Parallel tasks So far, the tasks we submitted have been run using a single core on the cluster, with no special other requests made. Here is a simple example using two cores; we'll use `hipercow_resources()` to specify we want two cores on the cluster, and `hipercow_parallel()` to say that we want to set up two processes on those cores, using the `parallel` package. (We also support [`future`](https://future.futureverse.org/)). ```{r parallel_task} resources <- hipercow_resources(cores = 2) id <- task_create_expr( parallel::clusterApply(NULL, 1:2, function(x) Sys.sleep(5)), parallel = hipercow_parallel("parallel"), resources = resources) task_wait(id) task_info(id) ``` Both of our parallel tasks are to sleep for 5 seconds. We use `task_info()` to report how long it took for those two runs to execute; if they ran one-by-one, we'd expect around 10 seconds, but we are seeing a much shorter time than that, so our pair of processes are running at the same time. For details on specifying resources and launching different kinds of parallel tasks, see `vignette("parallel")`. # Understanding where variables come from Suppose our simulation started not from 0, but from some point that we have computed locally (say `x`, imaginatively) ```{r} x <- 100 ``` You can use this value to start the simulation by running: ```{r create_walk3} id <- task_create_expr(random_walk(x, 10)) ``` Here the `x` value has come from the environment where the expression passed into `task_create_expr()` was found (specifically, we use the [`rlang` "tidy evaluation"](https://rlang.r-lib.org/reference/topic-defuse.html) framework you might be familiar with from `dplyr` and friends). ```{r result3} task_wait(id) task_result(id) ``` If you pass in an expression that references a value that does not exist locally, you will get a (hopefully) informative error message when the task is created: ```{r, error = TRUE, eval = FALSE} id <- task_create_expr(random_walk(starting_point, 10)) #> Error in `rlang::env_get_list()`: #> ! Can't find `starting_point` in environment. ``` # Cancelling tasks You can cancel a task if it has been submitted and not completed, using `task_cancel()`: For example, here's a task that will sleep for 10 seconds, which we submit to the cluster: ```{r} id <- task_create_expr(Sys.sleep(10)) ``` Having decided that this is a silly idea, we can try and cancel it: ```{r} task_cancel(id) task_status(id) task_info(id) ``` You can cancel a task that is submitted (waiting to be picked up by a cluster) or running (though not all drivers will support this; we need to add this to the example driver still, which will improve this example!). You can cancel many tasks at once by passing a vector of identifiers at the same time. Tasks that have finished (successfully or not) cannot be cancelled. # Retrying tasks There are lots of reasons why you might want to retry a task. For example: * it failed but you think it might work next time * you updated a package that it used, and want to try again with the new version * you don't like the output from some stochastic function and want to generate new output * you cancelled the task but want to try again now You can retry tasks with `task_retry()`, which is easier than submitting a new task with the same content, and also preserves a link between retried tasks. Our random walk will give slightly different results each time we use it, so we demonstrate the idea with that: ```{r} id1 <- task_create_expr(random_walk(0, 10)) task_wait(id1) task_result(id1) ``` Here we ran a random walk and it got to `r tail(task_result(id1), 1)`, which is clearly not what we were expecting. Let's try it again: ```{r} id2 <- task_retry(id1) ``` Running `task_retry()` creates a *new* task, with a new id `r abbrev_id(id2)` compared with `r abbrev_id(id1)`. Once this task has finished, we get a different result: ```{r} task_wait(id2) task_result(id2) ``` Much better! We get a hint that this is a retried task from the `task_info()` ```{r} task_info(id2) ``` You can see the full chain of retries here: ```{r} task_info(id2)$retry_chain ``` Once a task has been retried it affects how you interact with the previous ids; by default they follow through to the most recent element in the chain: ```{r} task_result(id1) task_result(id2) ``` You can get the original result back by passing the argument `follow = FALSE`: ```{r} task_result(id1, follow = FALSE) task_result(id2) ``` Only tasks that have been completed (`success`, `failure` or `cancelled`) can be retried, and doing so adds a new task to the *end* of the chain; there is no branching. Retrying the `id1` here would create the chain `id1 -> id2 -> id3`, and following would select `id3` for any of the three tasks in the chain. You cannot currently change any property of a retried task, we may change this in future. ```{r, include = FALSE} cleanup() ```