hipercow

Parallel computing on a cluster can be more challenging than running things locally because it’s often the first time that you need to package up code to run elsewhere, and when things go wrong it’s more difficult to get information on why things failed.

Much of the difficulty of getting things running involves working out what your code depends on, and getting that installed in the right place on a computer that you can’t physically poke at. The next set of problems is dealing with the ballooning set of files that end up being created - templates, scripts, output files, etc.

The hipercow package aims to remove some of this pain, with the aim that running a task on the cluster should be (almost) as straightforward as running things locally, at least once some basic setup is done.

At the moment, this document assumes that we will be using the “Windows” cluster, which implies the existence of some future non-Windows cluster. Stay tuned.

This manual is structured in escalating complexity, following the chain of things that a hypothetical user might encounter as they move from their first steps on the cluster through to running enormous batches of tasks.

Installing prerequisites

Install the required packages from our “r-universe”. Be sure to run this in a fresh session.

install.packages(
  "hipercow",
  repos = c("https://mrc-ide.r-universe.dev", "https://cloud.r-project.org"))

Once installed you can load the package with

library(hipercow)

or use the package by prefixing the calls below with hipercow::, as you prefer.

Follow any platform-specific instructions in vignettes("<cluster>"); this will depend on the cluster you intend to use:

  • Windows: vignette("windows")

Filesystems and paths

We need a concept of a “root”; the point in the filesystem we can think of everything relative to. This will feel familiar to you if you have used git or orderly, as these all have a root (and this root will be a fine place to put your cluster work). Typically all paths will be within this root directory, and paths above it, or absolute paths in general, effectively cease to exist. If your project works this way then it’s easy to move around, which is exactly what we need to do in order to run it on the cluster.

If you are using RStudio, then we strongly recommend using an RStudio project.

Initialising

Run

hipercow_init()
#> ✔ Initialised hipercow at '.' (/tmp/RtmpIBxDPQ/file12d95b274c97)
#> ℹ Next, call 'hipercow_configure()'

which will write things to a new path hipercow/ within your working directory.

After initialisation you will typically want to configure a “driver”, which controls how tasks are sent to clusters. At the moment the only option is the windows cluster so for practical work you would write:

hipercow_configure(driver = "windows")

however, for this vignette we will use a special “example” driver which simulates what the cluster will do (don’t use this for anything yourself, it really won’t help):

hipercow_configure(driver = "example")
#> ✔ Configured hipercow to use 'example'

You can run initialisation and configuration in one step by running

hipercow_init(driver = "windows")

After initialisation and configuration you can see the computed configuration by running hipercow_configuration():

hipercow_configuration()
#> 
#> ── hipercow root at /tmp/RtmpIBxDPQ/file12d95b274c97 ───────────────────────────
#> ✔ Working directory '.' within root
#> ℹ R version 4.4.2 on Linux (root@0eb07b8b4f99)
#> 
#> ── Packages ──
#> 
#> ℹ This is hipercow 1.0.39
#> ℹ Installed: conan2 (1.9.101), logwatch (0.1.1), rrq (0.7.21)
#> ✖ hipercow.windows is not installed
#> 
#> ── Environments ──
#> 
#> ── default
#> • packages: (none)
#> • sources: (none)
#> • globals: (none)
#> 
#> ── empty
#> • packages: (none)
#> • sources: (none)
#> • globals: (none)
#> 
#> ── Drivers ──
#> 
#> ✔ 1 driver configured ('example')
#> 
#> ── example
#> (unconfigurable)

Here, you can see versions of important packages, information about where you are working, and information about how you intend to interact with the cluster. See vignette("windows") for example output you might expect on the Windows cluster, which includes information about mapping of your paths onto those of the cluster, the version of R you will use, and other information.

If you have issues with hipercow we will always want to see the output of hipercow_configuration().

Running your first task

The first time you use the tools (ever, in a while, or on a new machine) we recommend sending off a tiny task to make sure that everything is working as expected:

id <- task_create_expr(sessionInfo())
#> ✔ Submitted task 'babcaba4b51ab381e6c0bfa36a45bf4b' using 'example'

This creates a new task that will run the expression sessionInfo() on the cluster. The task_create_expr() function works by so-called “non standard evaluation” and the expression is not evaluated from your R session, but sent to run on another machine.

The id returned is just an ugly hex string:

id
#> [1] "babcaba4b51ab381e6c0bfa36a45bf4b"

Many other functions accept this id as an argument. You can get the status of the task, which will have finished now because it really does not take very long:

task_status(id)
#> [1] "success"

Once the task has completed you can inspect the result:

task_result(id)
#> R version 4.4.2 (2024-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> loaded via a namespace (and not attached):
#> [1] compiler_4.4.2  cli_3.6.3       withr_3.0.2     hipercow_1.0.39
#> [5] rlang_1.1.4

Because we are using the “example” driver here, this is the same as the result that you’d get running sessionInfo() directly, just with more steps. See vignette("windows") for an example that runs on Windows.

Using functions you have written

It’s unlikely that the code you want to run on the cluster is one of the functions built into R itself; more likely you have written a simulation or similar and you want to run that instead. In order to do this, we need to tell the cluster where to find your code. There are two broad places where code that you want to run is likely to be found script files and packages; we start with the former here, and deal with packages in much more detail in vignette("packages").

Suppose you have a file simulation.R containing some simulation:

random_walk <- function(x, n_steps) {
  ret <- numeric(n_steps)
  for (i in seq_len(n_steps)) {
    x <- rnorm(1, x)
    ret[[i]] <- x
  }
  ret
}

We can’t run this on the cluster immediately, because the cluster does not know about the new function:

id <- task_create_expr(random_walk(0, 10))
#> ✔ Submitted task 'a4c3bb9a168050ee6274c891a886e8e7' using 'example'
task_wait(id)
#> [1] FALSE
task_status(id)
#> [1] "failure"
task_result(id)
#> <simpleError in random_walk(0, 10): could not find function "random_walk">

(See vignette("troubleshooting") for more on failures.)

We need to tell hipercow to source() the file simulation.R before running the task. To do this we use hipercow_environment_create() to create an “environment” (not to be confused with R’s environments) in which to run things:

hipercow_environment_create(sources = "simulation.R")
#> ✔ Created environment 'default'

Now we can run our simulation:

id <- task_create_expr(random_walk(0, 10))
#> ✔ Submitted task 'b0b6ce8cce6bedab8c1845a4d2ff3b3c' using 'example'
task_wait(id)
#> [1] TRUE
task_result(id)
#>  [1] 0.4851598 2.3403751 2.3996791 2.1452466 1.6566564 1.6814119 1.3055185
#>  [8] 3.3737229 3.5368047 3.8364824

We have more to write on environments but briefly:

  • You can have multiple environments and each task can be set to run in a different environment
  • Each environment can source any number of source files, and load any number of packages
  • This will become the mechanism by which environments on parallel workers (via parallel, future or rrq) will set up their environments

Getting information about tasks

Once you have created (and submitted) tasks, they will be queued by the cluster and eventually run. The hope is that we surface enough information to make it easy for you to see how things are going and what has gone wrong.

Fetching information with task_info()

The primary function for fetching information about a task is task_info():

task_info(id)
#> 
#> ── task b0b6ce8cce6bedab8c1845a4d2ff3b3c (success) ─────────────────────────────
#> ℹ Submitted with 'example'
#> ℹ Task type: expression
#>   • Expression: random_walk(0, 10)
#>   • Locals: (none)
#>   • Environment: default
#>     R_GC_MEM_GROW: 3
#> ℹ Created at 2024-11-22 08:31:11.477953 (moments ago)
#> ℹ Started at 2024-11-22 08:31:11.711176 (moments ago; waited 234ms)
#> ℹ Finished at 2024-11-22 08:31:11.950536 (moments ago; ran for 240ms)

This prints out core information about the task; its identifier (b0b6ce8cce6bedab8c1845a4d2ff3b3c) and status (success), along with information about what sort of task it was, what expression it had, variables it used, the environment it executed in and the time that key events happened for the task (when it was created, started and finished).

This display is meant to be friendly; if you need to compute on this information, you can access the times by reading the $times element of the task_info() return value:

task_info(id)$times
#>                   created                   started                  finished 
#> "2024-11-22 08:31:11 UTC" "2024-11-22 08:31:11 UTC" "2024-11-22 08:31:11 UTC"

Likewise, the information about the task itself is within $data. To work with the underling data you might just unclass the object to see the structure:

unclass(task_info(id))
#> $id
#> [1] "b0b6ce8cce6bedab8c1845a4d2ff3b3c"
#> 
#> $status
#> [1] "success"
#> 
#> $data
#> $data$type
#> [1] "expression"
#> 
#> $data$id
#> [1] "b0b6ce8cce6bedab8c1845a4d2ff3b3c"
#> 
#> $data$time
#> [1] "2024-11-22 08:31:11 UTC"
#> 
#> $data$path
#> [1] "."
#> 
#> $data$environment
#> [1] "default"
#> 
#> $data$envvars
#>            name value secret
#> 1 R_GC_MEM_GROW     3  FALSE
#> 
#> $data$parallel
#> NULL
#> 
#> $data$expr
#> random_walk(0, 10)
#> 
#> $data$variables
#> NULL
#> 
#> 
#> $driver
#> [1] "example"
#> 
#> $times
#>                   created                   started                  finished 
#> "2024-11-22 08:31:11 UTC" "2024-11-22 08:31:11 UTC" "2024-11-22 08:31:11 UTC" 
#> 
#> $retry_chain
#> NULL

but note that the exact structure is subject to (infrequent) change.

Fetching logs with task_log_show

Every task will produce some logs, and these can be an important part of understanding what they did and why they went wrong.

You can view the log with task_log_show()

task_log_show(id)
#> 
#> ── hipercow 1.0.39 running at '/tmp/RtmpIBxDPQ/file12d95b274c97' ───────────────
#> ℹ library paths:
#> • /tmp/RtmpQVKIDr/Rinst12234d16684f
#> • /github/workspace/pkglib
#> • /usr/local/lib/R/site-library
#> • /usr/lib/R/site-library
#> • /usr/lib/R/library
#> ℹ id: b0b6ce8cce6bedab8c1845a4d2ff3b3c
#> ℹ starting at: 2024-11-22 08:31:11.711176
#> ℹ Task type: expression
#> • Expression: random_walk(0, 10)
#> • Locals: (none)
#> • Environment: default
#>   R_GC_MEM_GROW: 3
#> ℹ Loading environment 'default'...
#> • packages: (none)
#> • sources: simulation.R
#> • globals: (none)
#> ───────────────────────────────────────────────────────────────── task logs ↓ ──
#> 
#> ───────────────────────────────────────────────────────────────── task logs ↑ ──
#> ✔ status: success
#> ℹ finishing at: 2024-11-22 08:31:11.711176 (elapsed: 0.243 secs)

This prints the contents of the logs to the screen; you can access the values directly with task_log_value(id). The format of the logs will be generally the same for all tasks; after the header saying where we are running, some information about the task will be printed (its identifier, the time, details about the task itself), then any logs that come from calls to message() and print() within the queued function (within the “task logs” section; here that is empty because our task prints nothing). Finally, a summary will be printed with the final status, final time (and elapsed time), then any warnings that were produced will be flushed (see vignette("troubleshooting") for more on warnings).

There is a second log too, the “outer” log, which is generally less interesting so it is not the default. These logs come from the cluster scheduler itself and show the startup process that leads up to (and after) the code that hipercow itself runs. It will differ from driver to driver. In addition, this log may not be available forever; the windows cluster retains it only for a couple of weeks:

task_log_show(id, outer = TRUE)
#> Running task b0b6ce8cce6bedab8c1845a4d2ff3b3c
#> Finished task b0b6ce8cce6bedab8c1845a4d2ff3b3c

The logs returned by task_log_show(id, outer = FALSE) are the logs generated by the statement containing Rscript -e.

Watching logs with task_log_watch

If your task is still running, you can stream logs to your computer using task_log_watch(); this will print new logs line-by-line as they arrive (with a delay of up to 1s by default). This can be useful while debugging something to give the illusion that you’re running it locally.

Using Ctrl-C (or ESC in RStudio) to escape will only stop log streaming and not the underlying task.

Parallel tasks

So far, the tasks we submitted have been run using a single core on the cluster, with no special other requests made. Here is a simple example using two cores; we’ll use hipercow_resources() to specify we want two cores on the cluster, and hipercow_parallel() to say that we want to set up two processes on those cores, using the parallel package. (We also support future).

resources <- hipercow_resources(cores = 2)
id <- task_create_expr(
  parallel::clusterApply(NULL, 1:2, function(x) Sys.sleep(5)),
  parallel = hipercow_parallel("parallel"),
  resources = resources)
#> ✔ Submitted task '81ea6ba90fbb050ed08ab94f57a3110e' using 'example'
task_wait(id)
#> [1] TRUE
task_info(id)
#> 
#> ── task 81ea6ba90fbb050ed08ab94f57a3110e (success) ─────────────────────────────
#> ℹ Submitted with 'example'
#> ℹ Task type: expression
#>   • Expression: parallel::clusterApply(NULL, 1:2, function(x) Sys.sleep(5))
#>   • Locals: (none)
#>   • Environment: default
#>     R_GC_MEM_GROW: 3
#> ℹ Created at 2024-11-22 08:31:12.670446 (moments ago)
#> ℹ Started at 2024-11-22 08:31:12.950037 (moments ago; waited 280ms)
#> ℹ Finished at 2024-11-22 08:31:18.513741 (moments ago; ran for 5.6s)

Both of our parallel tasks are to sleep for 5 seconds. We use task_info() to report how long it took for those two runs to execute; if they ran one-by-one, we’d expect around 10 seconds, but we are seeing a much shorter time than that, so our pair of processes are running at the same time.

For details on specifying resources and launching different kinds of parallel tasks, see vignette("parallel").

Understanding where variables come from

Suppose our simulation started not from 0, but from some point that we have computed locally (say x, imaginatively)

x <- 100

You can use this value to start the simulation by running:

id <- task_create_expr(random_walk(x, 10))
#> ✔ Submitted task '8f58f44d59d8bc8ffc857d719a921ce2' using 'example'

Here the x value has come from the environment where the expression passed into task_create_expr() was found (specifically, we use the rlang “tidy evaluation” framework you might be familiar with from dplyr and friends).

task_wait(id)
#> [1] TRUE
task_result(id)
#>  [1] 101.7804 102.6891 103.0728 102.2522 102.2473 101.9329 101.1864 101.7956
#>  [9] 100.9852 101.7734

If you pass in an expression that references a value that does not exist locally, you will get a (hopefully) informative error message when the task is created:

id <- task_create_expr(random_walk(starting_point, 10))
#> Error in `rlang::env_get_list()`:
#> ! Can't find `starting_point` in environment.

Cancelling tasks

You can cancel a task if it has been submitted and not completed, using task_cancel():

For example, here’s a task that will sleep for 10 seconds, which we submit to the cluster:

id <- task_create_expr(Sys.sleep(10))
#> ✔ Submitted task '934fb3eb03af163329b5d5a92d128556' using 'example'

Having decided that this is a silly idea, we can try and cancel it:

task_cancel(id)
#> ✔ Successfully cancelled '934fb3eb03af163329b5d5a92d128556'
#> [1] TRUE
task_status(id)
#> [1] "cancelled"
task_info(id)
#> 
#> ── task 934fb3eb03af163329b5d5a92d128556 (cancelled) ───────────────────────────
#> ℹ Submitted with 'example'
#> ℹ Task type: expression
#>   • Expression: Sys.sleep(10)
#>   • Locals: (none)
#>   • Environment: default
#>     R_GC_MEM_GROW: 3
#> ℹ Created at 2024-11-22 08:31:19.824872 (moments ago)
#> ✖ Start time unknown!
#> ℹ Finished at 2024-11-22 08:31:19.847178 (moments ago; ran for ???)

You can cancel a task that is submitted (waiting to be picked up by a cluster) or running (though not all drivers will support this; we need to add this to the example driver still, which will improve this example!).

You can cancel many tasks at once by passing a vector of identifiers at the same time. Tasks that have finished (successfully or not) cannot be cancelled.

Retrying tasks

There are lots of reasons why you might want to retry a task. For example:

  • it failed but you think it might work next time
  • you updated a package that it used, and want to try again with the new version
  • you don’t like the output from some stochastic function and want to generate new output
  • you cancelled the task but want to try again now

You can retry tasks with task_retry(), which is easier than submitting a new task with the same content, and also preserves a link between retried tasks.

Our random walk will give slightly different results each time we use it, so we demonstrate the idea with that:

id1 <- task_create_expr(random_walk(0, 10))
#> ✔ Submitted task 'e39de045a4056c9f534903270fd0367f' using 'example'
task_wait(id1)
#> [1] TRUE
task_result(id1)
#>  [1]  0.46617185  0.07548012  1.37135968 -0.01854800 -0.53296567 -0.40895560
#>  [7] -0.96380711 -0.27986102 -1.87592889 -1.89406334

Here we ran a random walk and it got to -1.8940633, which is clearly not what we were expecting. Let’s try it again:

id2 <- task_retry(id1)
#> ✔ Submitted task 'a50986b5247aa124471c733700dd1350' using 'example'

Running task_retry() creates a new task, with a new id a50986... compared with e39de0....

Once this task has finished, we get a different result:

task_wait(id2)
#> [1] TRUE
task_result(id2)
#>  [1] 1.013508 2.470639 1.827464 2.381489 2.151984 4.229362 5.265307 4.539213
#>  [9] 5.152102 6.056290

Much better!

We get a hint that this is a retried task from the task_info()

task_info(id2)
#> 
#> ── task a50986b5247aa124471c733700dd1350 (success) ─────────────────────────────
#> ℹ Submitted with 'example'
#> ℹ Task type: expression
#>   • Expression: random_walk(0, 10)
#>   • Locals: (none)
#>   • Environment: default
#>     R_GC_MEM_GROW: 3
#> ℹ Created at 2024-11-22 08:31:19.914456 (moments ago)
#> ℹ Started at 2024-11-22 08:31:21.218358 (moments ago; waited 1.3s)
#> ℹ Finished at 2024-11-22 08:31:21.456908 (moments ago; ran for 239ms)
#> ℹ Last of a chain of a task retried 1 time

You can see the full chain of retries here:

task_info(id2)$retry_chain
#> [1] "e39de045a4056c9f534903270fd0367f" "a50986b5247aa124471c733700dd1350"

Once a task has been retried it affects how you interact with the previous ids; by default they follow through to the most recent element in the chain:

task_result(id1)
#>  [1] 1.013508 2.470639 1.827464 2.381489 2.151984 4.229362 5.265307 4.539213
#>  [9] 5.152102 6.056290
task_result(id2)
#>  [1] 1.013508 2.470639 1.827464 2.381489 2.151984 4.229362 5.265307 4.539213
#>  [9] 5.152102 6.056290

You can get the original result back by passing the argument follow = FALSE:

task_result(id1, follow = FALSE)
#>  [1]  0.46617185  0.07548012  1.37135968 -0.01854800 -0.53296567 -0.40895560
#>  [7] -0.96380711 -0.27986102 -1.87592889 -1.89406334
task_result(id2)
#>  [1] 1.013508 2.470639 1.827464 2.381489 2.151984 4.229362 5.265307 4.539213
#>  [9] 5.152102 6.056290

Only tasks that have been completed (success, failure or cancelled) can be retried, and doing so adds a new task to the end of the chain; there is no branching. Retrying the id1 here would create the chain id1 -> id2 -> id3, and following would select id3 for any of the three tasks in the chain.

You cannot currently change any property of a retried task, we may change this in future.