Parallel computing on a cluster can be more challenging than running things locally because it’s often the first time that you need to package up code to run elsewhere, and when things go wrong it’s more difficult to get information on why things failed.
Much of the difficulty of getting things running involves working out what your code depends on, and getting that installed in the right place on a computer that you can’t physically poke at. The next set of problems is dealing with the ballooning set of files that end up being created - templates, scripts, output files, etc.
The hipercow
package aims to remove some of this pain,
with the aim that running a task on the cluster should be (almost) as
straightforward as running things locally, at least once some basic
setup is done.
At the moment, this document assumes that we will be using the “Windows” cluster, which implies the existence of some future non-Windows cluster. Stay tuned.
This manual is structured in escalating complexity, following the chain of things that a hypothetical user might encounter as they move from their first steps on the cluster through to running enormous batches of tasks.
Install the required packages from our “r-universe”. Be sure to run this in a fresh session.
install.packages(
"hipercow",
repos = c("https://mrc-ide.r-universe.dev", "https://cloud.r-project.org"))
Once installed you can load the package with
or use the package by prefixing the calls below with
hipercow::
, as you prefer.
Follow any platform-specific instructions in
vignettes("<cluster>")
; this will depend on the
cluster you intend to use:
vignette("windows")
We need a concept of a “root”; the point in the filesystem we can think of everything relative to. This will feel familiar to you if you have used git or orderly, as these all have a root (and this root will be a fine place to put your cluster work). Typically all paths will be within this root directory, and paths above it, or absolute paths in general, effectively cease to exist. If your project works this way then it’s easy to move around, which is exactly what we need to do in order to run it on the cluster.
If you are using RStudio, then we strongly recommend using an RStudio project.
Run
hipercow_init()
#> ✔ Initialised hipercow at '.' (/tmp/RtmpIBxDPQ/file12d95b274c97)
#> ℹ Next, call 'hipercow_configure()'
which will write things to a new path hipercow/
within
your working directory.
After initialisation you will typically want to configure a “driver”, which controls how tasks are sent to clusters. At the moment the only option is the windows cluster so for practical work you would write:
however, for this vignette we will use a special “example” driver which simulates what the cluster will do (don’t use this for anything yourself, it really won’t help):
You can run initialisation and configuration in one step by running
After initialisation and configuration you can see the computed
configuration by running hipercow_configuration()
:
hipercow_configuration()
#>
#> ── hipercow root at /tmp/RtmpIBxDPQ/file12d95b274c97 ───────────────────────────
#> ✔ Working directory '.' within root
#> ℹ R version 4.4.2 on Linux (root@0eb07b8b4f99)
#>
#> ── Packages ──
#>
#> ℹ This is hipercow 1.0.39
#> ℹ Installed: conan2 (1.9.101), logwatch (0.1.1), rrq (0.7.21)
#> ✖ hipercow.windows is not installed
#>
#> ── Environments ──
#>
#> ── default
#> • packages: (none)
#> • sources: (none)
#> • globals: (none)
#>
#> ── empty
#> • packages: (none)
#> • sources: (none)
#> • globals: (none)
#>
#> ── Drivers ──
#>
#> ✔ 1 driver configured ('example')
#>
#> ── example
#> (unconfigurable)
Here, you can see versions of important packages, information about
where you are working, and information about how you intend to interact
with the cluster. See vignette("windows")
for example
output you might expect on the Windows cluster, which includes
information about mapping of your paths onto those of the cluster, the
version of R you will use, and other information.
If you have issues with hipercow
we will always want to
see the output of hipercow_configuration()
.
The first time you use the tools (ever, in a while, or on a new machine) we recommend sending off a tiny task to make sure that everything is working as expected:
id <- task_create_expr(sessionInfo())
#> ✔ Submitted task 'babcaba4b51ab381e6c0bfa36a45bf4b' using 'example'
This creates a new task that will run the expression
sessionInfo()
on the cluster. The
task_create_expr()
function works by so-called “non standard
evaluation” and the expression is not evaluated from your R session,
but sent to run on another machine.
The id
returned is just an ugly hex string:
Many other functions accept this id
as an argument. You
can get the status of the task, which will have finished now because it
really does not take very long:
Once the task has completed you can inspect the result:
task_result(id)
#> R version 4.4.2 (2024-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> loaded via a namespace (and not attached):
#> [1] compiler_4.4.2 cli_3.6.3 withr_3.0.2 hipercow_1.0.39
#> [5] rlang_1.1.4
Because we are using the “example” driver here, this is the same as
the result that you’d get running sessionInfo()
directly,
just with more steps. See vignette("windows")
for an
example that runs on Windows.
It’s unlikely that the code you want to run on the cluster is one of
the functions built into R itself; more likely you have written a
simulation or similar and you want to run that instead. In
order to do this, we need to tell the cluster where to find your code.
There are two broad places where code that you want to run is likely to
be found script files and packages; we
start with the former here, and deal with packages in much more detail
in vignette("packages")
.
Suppose you have a file simulation.R
containing some
simulation:
random_walk <- function(x, n_steps) {
ret <- numeric(n_steps)
for (i in seq_len(n_steps)) {
x <- rnorm(1, x)
ret[[i]] <- x
}
ret
}
We can’t run this on the cluster immediately, because the cluster does not know about the new function:
id <- task_create_expr(random_walk(0, 10))
#> ✔ Submitted task 'a4c3bb9a168050ee6274c891a886e8e7' using 'example'
task_wait(id)
#> [1] FALSE
task_status(id)
#> [1] "failure"
task_result(id)
#> <simpleError in random_walk(0, 10): could not find function "random_walk">
(See vignette("troubleshooting")
for more on
failures.)
We need to tell hipercow to source()
the file
simulation.R
before running the task. To do this we use
hipercow_environment_create()
to create an “environment”
(not to be confused with R’s environments) in which to run things:
Now we can run our simulation:
id <- task_create_expr(random_walk(0, 10))
#> ✔ Submitted task 'b0b6ce8cce6bedab8c1845a4d2ff3b3c' using 'example'
task_wait(id)
#> [1] TRUE
task_result(id)
#> [1] 0.4851598 2.3403751 2.3996791 2.1452466 1.6566564 1.6814119 1.3055185
#> [8] 3.3737229 3.5368047 3.8364824
We have more to write on environments but briefly:
parallel
, future
or
rrq
) will set up their environmentsOnce you have created (and submitted) tasks, they will be queued by the cluster and eventually run. The hope is that we surface enough information to make it easy for you to see how things are going and what has gone wrong.
task_info()
The primary function for fetching information about a task is
task_info()
:
task_info(id)
#>
#> ── task b0b6ce8cce6bedab8c1845a4d2ff3b3c (success) ─────────────────────────────
#> ℹ Submitted with 'example'
#> ℹ Task type: expression
#> • Expression: random_walk(0, 10)
#> • Locals: (none)
#> • Environment: default
#> R_GC_MEM_GROW: 3
#> ℹ Created at 2024-11-22 08:31:11.477953 (moments ago)
#> ℹ Started at 2024-11-22 08:31:11.711176 (moments ago; waited 234ms)
#> ℹ Finished at 2024-11-22 08:31:11.950536 (moments ago; ran for 240ms)
This prints out core information about the task; its identifier
(b0b6ce8cce6bedab8c1845a4d2ff3b3c
) and status
(success
), along with information about what sort of task
it was, what expression it had, variables it used, the environment it
executed in and the time that key events happened for the task (when it
was created, started and finished).
This display is meant to be friendly; if you need to compute on this
information, you can access the times by reading the $times
element of the task_info()
return value:
task_info(id)$times
#> created started finished
#> "2024-11-22 08:31:11 UTC" "2024-11-22 08:31:11 UTC" "2024-11-22 08:31:11 UTC"
Likewise, the information about the task itself is within
$data
. To work with the underling data you might just
unclass the object to see the structure:
unclass(task_info(id))
#> $id
#> [1] "b0b6ce8cce6bedab8c1845a4d2ff3b3c"
#>
#> $status
#> [1] "success"
#>
#> $data
#> $data$type
#> [1] "expression"
#>
#> $data$id
#> [1] "b0b6ce8cce6bedab8c1845a4d2ff3b3c"
#>
#> $data$time
#> [1] "2024-11-22 08:31:11 UTC"
#>
#> $data$path
#> [1] "."
#>
#> $data$environment
#> [1] "default"
#>
#> $data$envvars
#> name value secret
#> 1 R_GC_MEM_GROW 3 FALSE
#>
#> $data$parallel
#> NULL
#>
#> $data$expr
#> random_walk(0, 10)
#>
#> $data$variables
#> NULL
#>
#>
#> $driver
#> [1] "example"
#>
#> $times
#> created started finished
#> "2024-11-22 08:31:11 UTC" "2024-11-22 08:31:11 UTC" "2024-11-22 08:31:11 UTC"
#>
#> $retry_chain
#> NULL
but note that the exact structure is subject to (infrequent) change.
task_log_show
Every task will produce some logs, and these can be an important part of understanding what they did and why they went wrong.
You can view the log with task_log_show()
task_log_show(id)
#>
#> ── hipercow 1.0.39 running at '/tmp/RtmpIBxDPQ/file12d95b274c97' ───────────────
#> ℹ library paths:
#> • /tmp/RtmpQVKIDr/Rinst12234d16684f
#> • /github/workspace/pkglib
#> • /usr/local/lib/R/site-library
#> • /usr/lib/R/site-library
#> • /usr/lib/R/library
#> ℹ id: b0b6ce8cce6bedab8c1845a4d2ff3b3c
#> ℹ starting at: 2024-11-22 08:31:11.711176
#> ℹ Task type: expression
#> • Expression: random_walk(0, 10)
#> • Locals: (none)
#> • Environment: default
#> R_GC_MEM_GROW: 3
#> ℹ Loading environment 'default'...
#> • packages: (none)
#> • sources: simulation.R
#> • globals: (none)
#> ───────────────────────────────────────────────────────────────── task logs ↓ ──
#>
#> ───────────────────────────────────────────────────────────────── task logs ↑ ──
#> ✔ status: success
#> ℹ finishing at: 2024-11-22 08:31:11.711176 (elapsed: 0.243 secs)
This prints the contents of the logs to the screen; you can access
the values directly with task_log_value(id)
. The format of
the logs will be generally the same for all tasks; after the header
saying where we are running, some information about the task will be
printed (its identifier, the time, details about the task itself), then
any logs that come from calls to message()
and
print()
within the queued function (within the “task logs”
section; here that is empty because our task prints nothing). Finally, a
summary will be printed with the final status, final time (and elapsed
time), then any warnings that were produced will be flushed (see
vignette("troubleshooting")
for more on warnings).
There is a second log too, the “outer” log, which is generally less interesting so it is not the default. These logs come from the cluster scheduler itself and show the startup process that leads up to (and after) the code that hipercow itself runs. It will differ from driver to driver. In addition, this log may not be available forever; the windows cluster retains it only for a couple of weeks:
task_log_show(id, outer = TRUE)
#> Running task b0b6ce8cce6bedab8c1845a4d2ff3b3c
#> Finished task b0b6ce8cce6bedab8c1845a4d2ff3b3c
The logs returned by task_log_show(id, outer = FALSE)
are the logs generated by the statement containing
Rscript -e
.
task_log_watch
If your task is still running, you can stream logs to your computer
using task_log_watch()
; this will print new logs
line-by-line as they arrive (with a delay of up to 1s by default). This
can be useful while debugging something to give the illusion that you’re
running it locally.
Using Ctrl-C
(or ESC
in RStudio) to escape
will only stop log streaming and not the underlying task.
So far, the tasks we submitted have been run using a single core on
the cluster, with no special other requests made. Here is a simple
example using two cores; we’ll use hipercow_resources()
to
specify we want two cores on the cluster, and
hipercow_parallel()
to say that we want to set up two
processes on those cores, using the parallel
package. (We
also support future
).
resources <- hipercow_resources(cores = 2)
id <- task_create_expr(
parallel::clusterApply(NULL, 1:2, function(x) Sys.sleep(5)),
parallel = hipercow_parallel("parallel"),
resources = resources)
#> ✔ Submitted task '81ea6ba90fbb050ed08ab94f57a3110e' using 'example'
task_wait(id)
#> [1] TRUE
task_info(id)
#>
#> ── task 81ea6ba90fbb050ed08ab94f57a3110e (success) ─────────────────────────────
#> ℹ Submitted with 'example'
#> ℹ Task type: expression
#> • Expression: parallel::clusterApply(NULL, 1:2, function(x) Sys.sleep(5))
#> • Locals: (none)
#> • Environment: default
#> R_GC_MEM_GROW: 3
#> ℹ Created at 2024-11-22 08:31:12.670446 (moments ago)
#> ℹ Started at 2024-11-22 08:31:12.950037 (moments ago; waited 280ms)
#> ℹ Finished at 2024-11-22 08:31:18.513741 (moments ago; ran for 5.6s)
Both of our parallel tasks are to sleep for 5 seconds. We use
task_info()
to report how long it took for those two runs
to execute; if they ran one-by-one, we’d expect around 10 seconds, but
we are seeing a much shorter time than that, so our pair of processes
are running at the same time.
For details on specifying resources and launching different kinds of
parallel tasks, see vignette("parallel")
.
Suppose our simulation started not from 0, but from some point that
we have computed locally (say x
, imaginatively)
You can use this value to start the simulation by running:
id <- task_create_expr(random_walk(x, 10))
#> ✔ Submitted task '8f58f44d59d8bc8ffc857d719a921ce2' using 'example'
Here the x
value has come from the environment where the
expression passed into task_create_expr()
was found
(specifically, we use the rlang
“tidy evaluation” framework you might be familiar with from
dplyr
and friends).
task_wait(id)
#> [1] TRUE
task_result(id)
#> [1] 101.7804 102.6891 103.0728 102.2522 102.2473 101.9329 101.1864 101.7956
#> [9] 100.9852 101.7734
If you pass in an expression that references a value that does not exist locally, you will get a (hopefully) informative error message when the task is created:
You can cancel a task if it has been submitted and not completed,
using task_cancel()
:
For example, here’s a task that will sleep for 10 seconds, which we submit to the cluster:
id <- task_create_expr(Sys.sleep(10))
#> ✔ Submitted task '934fb3eb03af163329b5d5a92d128556' using 'example'
Having decided that this is a silly idea, we can try and cancel it:
task_cancel(id)
#> ✔ Successfully cancelled '934fb3eb03af163329b5d5a92d128556'
#> [1] TRUE
task_status(id)
#> [1] "cancelled"
task_info(id)
#>
#> ── task 934fb3eb03af163329b5d5a92d128556 (cancelled) ───────────────────────────
#> ℹ Submitted with 'example'
#> ℹ Task type: expression
#> • Expression: Sys.sleep(10)
#> • Locals: (none)
#> • Environment: default
#> R_GC_MEM_GROW: 3
#> ℹ Created at 2024-11-22 08:31:19.824872 (moments ago)
#> ✖ Start time unknown!
#> ℹ Finished at 2024-11-22 08:31:19.847178 (moments ago; ran for ???)
You can cancel a task that is submitted (waiting to be picked up by a cluster) or running (though not all drivers will support this; we need to add this to the example driver still, which will improve this example!).
You can cancel many tasks at once by passing a vector of identifiers at the same time. Tasks that have finished (successfully or not) cannot be cancelled.
There are lots of reasons why you might want to retry a task. For example:
You can retry tasks with task_retry()
, which is easier
than submitting a new task with the same content, and also preserves a
link between retried tasks.
Our random walk will give slightly different results each time we use it, so we demonstrate the idea with that:
id1 <- task_create_expr(random_walk(0, 10))
#> ✔ Submitted task 'e39de045a4056c9f534903270fd0367f' using 'example'
task_wait(id1)
#> [1] TRUE
task_result(id1)
#> [1] 0.46617185 0.07548012 1.37135968 -0.01854800 -0.53296567 -0.40895560
#> [7] -0.96380711 -0.27986102 -1.87592889 -1.89406334
Here we ran a random walk and it got to -1.8940633, which is clearly not what we were expecting. Let’s try it again:
Running task_retry()
creates a new task, with a
new id a50986...
compared with e39de0...
.
Once this task has finished, we get a different result:
task_wait(id2)
#> [1] TRUE
task_result(id2)
#> [1] 1.013508 2.470639 1.827464 2.381489 2.151984 4.229362 5.265307 4.539213
#> [9] 5.152102 6.056290
Much better!
We get a hint that this is a retried task from the
task_info()
task_info(id2)
#>
#> ── task a50986b5247aa124471c733700dd1350 (success) ─────────────────────────────
#> ℹ Submitted with 'example'
#> ℹ Task type: expression
#> • Expression: random_walk(0, 10)
#> • Locals: (none)
#> • Environment: default
#> R_GC_MEM_GROW: 3
#> ℹ Created at 2024-11-22 08:31:19.914456 (moments ago)
#> ℹ Started at 2024-11-22 08:31:21.218358 (moments ago; waited 1.3s)
#> ℹ Finished at 2024-11-22 08:31:21.456908 (moments ago; ran for 239ms)
#> ℹ Last of a chain of a task retried 1 time
You can see the full chain of retries here:
task_info(id2)$retry_chain
#> [1] "e39de045a4056c9f534903270fd0367f" "a50986b5247aa124471c733700dd1350"
Once a task has been retried it affects how you interact with the previous ids; by default they follow through to the most recent element in the chain:
task_result(id1)
#> [1] 1.013508 2.470639 1.827464 2.381489 2.151984 4.229362 5.265307 4.539213
#> [9] 5.152102 6.056290
task_result(id2)
#> [1] 1.013508 2.470639 1.827464 2.381489 2.151984 4.229362 5.265307 4.539213
#> [9] 5.152102 6.056290
You can get the original result back by passing the argument
follow = FALSE
:
task_result(id1, follow = FALSE)
#> [1] 0.46617185 0.07548012 1.37135968 -0.01854800 -0.53296567 -0.40895560
#> [7] -0.96380711 -0.27986102 -1.87592889 -1.89406334
task_result(id2)
#> [1] 1.013508 2.470639 1.827464 2.381489 2.151984 4.229362 5.265307 4.539213
#> [9] 5.152102 6.056290
Only tasks that have been completed (success
,
failure
or cancelled
) can be retried, and
doing so adds a new task to the end of the chain; there is no
branching. Retrying the id1
here would create the chain
id1 -> id2 -> id3
, and following would select
id3
for any of the three tasks in the chain.
You cannot currently change any property of a retried task, we may change this in future.