---
title: "Using didehpc to run cluster jobs"
output:
  rmarkdown::html_vignette:
    toc: true
    toc_depth: 2
vignette: >
  %\VignetteIndexEntry{Using didehpc to run cluster jobs}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

<!-- DO NOT EDIT THIS FILE - see vignettes_src and make changes there -->


Parallel computing on a cluster can be more challenging than
running things locally because it's often the first time that you
need to package up code to run elsewhere, and when things go wrong
it's more difficult to get information on why things failed.

Much of the difficulty of getting things running involves working
out what your code depends on, and getting that installed in the
right place on a computer that you can't physically poke at.  The
next set of problems is dealing with the ballooning set of files
that end up being created - templates, scripts, output files, etc.

This set of packages
([`didehpc`](https://github.com/mrc-ide/didehpc),
[`queuer`](https://github.com/mrc-ide/queuer) and
[`context`](https://github.com/mrc-ide/context), along with a
couple of support packages
([`conan`](https://github.com/mrc-ide/conan),
[`rrq`](https://github.com/mrc-ide/rrq) and
[`storr`](https://github.com/richfitz/storr)) aims to remove the
pain of getting everything set up, and getting cluster tasks
running, and retrieving your results.

Once everything is set up, running a job on the cluster should be
as straightforward as running things locally.

The documentation here runs through a few of the key concepts, then
walks through setting this all up.  There's also a "quick start"
guide that contains much less discussion.

## Functions

The biggest conceptual move is from thinking about running
**scripts** that generate *files* to running **functions** that
return *objects*.  The reason for this is that gives a well defined
interface to build everything else around.

The problem with scripts is that they might do almost anything.
They depend on untold files and packages which they load wherever.
The produce any number of objects.  That's fine, but it becomes
hard to reason about them to plan deploying them elsewhere, to
capture the outputs appropriately, or to orchestrate looping over a
bunch of parameter values.  If you've found yourself writing a
number of script files changing values with text substitution you
have run into this.

In contrast, functions do (ideally) one thing.  They have a well
defined set of inputs (their arguments) and outputs (their return
value).  We can loop over a range of input values by iterating over
a set of arguments.

This set of packages tends to work best if you let it look after
filenames.  Rather than trying to come up with a naming scheme for
different files as based on parameter values, just return objects
and the packages will arrange for them to be saved and reloaded.

## Filesystems

The DIDE cluster needs everything to be available on a filesystem
that the cluster can read.  Practically this means the filesystems
`//fi--didef3.dide.ic.ac.uk/tmp` or `//fi--san03.dide.ic.ac.uk/homes/username` and the like.
You probably have access to network shares that are specific to a
project, too.  For Windows users these are probably mapped to
drives (`Q:` or `T:` or similar) already, but for other platforms
you will need to do a little extra work to get things set up (see
below).

It is simplest if *everything* that is needed for a project is
present in a single directory that is visible on the cluster.

However for the most of this document I will assume that everything
is in one directory, which is on a network share.

**IMPORTANT**: If you are not sure if you are running on a network share, run `getwd()`; if you are on windows the drive letter should show something like `Q:` or some other drive that represents a network drive. If it says `C:` or similar *nothing below here will work*.

# Getting started

The initial setup will feel like a headache at first, but it should
ultimately take only a few lines.  Once everything is set up, then
the payback is that is the job submission part will become a lot
simpler.

## Installation

Install the packages using [`drat`](https://cran.rstudio.com/package=drat)

```r
# install.package("drat") # if you don't have it already
drat:::add("mrc-ide")
install.packages("didehpc")
```

Be sure to run this in a fresh session.

## Configuration

The configuration is handled in a two stage process.  First, some
bits that are machine specific are set using `options` with option
names that are prefixed with `didehpc`.  Then when a queue is
created, further values can be passed along via the `config`
argument that will use the "global" options as a default.

The reason for this separation is that ideally the machine-specific
options will not end up in scripts, because that makes things less
portable (for example, we need to get your username, but your
username is unlikely to work for your collaborators).

Ideally in your ~/.Rprofile file, you will add something like:

```r
options(
  didehpc.username = "rfitzjoh",
  didehpc.home = "~/net/home")
```

and then set only options (such as cluster and cores or template)
that vary with a project.

If you use the "big" cluster, you can add `didehpc.cluster =
"fi--didemrchnb"` here.


(to set this up, try running `usethis::edit_r_profile()`)

### Credentials

Windows users will not need to provide anything unless they are on
a non-domain machine or they are in the unfortunate situation of
juggling multiple usernames across systems.  Non-domain machines
will need the credentials set as above.

Mac users will need to provide their username here as above.

If you have a Linux system and have configured your smb mounts as
described below, you might as well take advantage of this and set
`credentials = "~/.smbcredentials"` and you will never be prompted
for your password:

```r
options(didehpc.credentials = "~/.smbcredentials")
```

### Seeing the default configuration

To see the configuration that will be run if you don't do anything
(else), run:


```r
didehpc::didehpc_config()
#> <didehpc_config>
#>  - cluster: fi--dideclusthn
#>  - credentials:
#>     - username: rfitzjoh
#>     - password: *******************
#>  - username: rfitzjoh
#>  - resource:
#>     - template: GeneralNodes
#>     - parallel: FALSE
#>     - count: 1
#>     - type: Cores
#>  - shares:
#>     - home: (local) /home/rich/net/home => \\fi--san03.dide.ic.ac.uk\homes\rfitzjoh => Q: (remote)
#>     - temp: (local) /home/rich/net/temp => \\fi--didef3.dide.ic.ac.uk\tmp => T: (remote)
#>  - use_workers: FALSE
#>  - use_rrq: FALSE
#>  - worker_timeout: 600
#>  - conan_bootstrap: TRUE
#>  - r_version: 4.0.3
#>  - use_java: FALSE
#>  - redis_host: fi--dideclusthn.dide.ic.ac.uk
```

In here you can see the cluster (here, `fi--didemrchnb`),
credentials and username, the job template (`GeneralNodes`),
information about the resources that will be requested (1 core) and
information on filesystem mappings.  There are a few other bits of
information that may be explained further down.  The possible
options are explained further in the help for `didehpc::didehpc_config`

If you request help, we will almost always want to see this!

### Additional shares

If you refer to network shares in your functions, e.g., to refer to
data, you'll need to map these too.  To do that, pass them as the
`shares` argument to `didehpc::didehpc_config`.

To describe each share, use the `didehpc::path_mapping` function
which takes arguments:

* name: a descriptive name for the share
* `path_local`: the point where the share is mounted on your computer
* `path_remote`: the network path that the share refers to (forward
  slashes are much easier to enter here than backward slashes)
* `drive_remote`: the drive this should be mapped to on the cluster.

So to map your "M drive" to which points at `\\fi--didef3.dide.ic.ac.uk\malaria`
to `M:` on the cluster you can write

```r
share <- didehpc::path_mapping("malaria", "M:", "//fi--didef3.dide.ic.ac.uk/malaria", "M:")
config <- didehpc::didehpc_config(shares = share)
```

If you have more than one share to map, pass them through as a list
(e.g., `didehpc::didehpc_config(shares = list(share1, share2, ...))`).

For most systems we hope that `didehpc` will do a reasonable job of detecting
the shares that you are running on, so this should (hopefully) only
be necessary for detecting additional shares.  The issue there is
that you'll need to use absolute paths to refer to the resources
and that's going to complicate things...

## Contexts

To recreate your work environment on the cluster, we use a package
called `context`.  This package uses the assumption that most
working environments can be recreated by a combination of R
packages and sourcing a set of function definitions.

### Root

Every context has a "root"; this is the directory that everything
will be saved in.  Most of the examples in the help use `contexts`
which is fairly self explanatory but it can be any directory.
Generally it will be in the current directory.


```r
root <- "contexts"
```

This directory is going to get large over time and will eventually
need to be deleted. Don't treat these as archival storage - more as long-lived temporary directories and don't be afraid to create a new one and delete old ones when you've collected your results.

### Packages

If you list packages as a character vector then all packages will
be installed for you, and they will also be *attached*; this is
what happens when you use the function `library()` So for example
if you need to depend on the `rstan` and `dplyr` packages you could
write:

```r
ctx <- context::context_save(root, packages = c("rstan", "dplyr"))
```


Attaching packages is not always what is wanted, especially if you
have packages that clobber functions in base packages (e.g.,
`dplyr`!).  An alternative is to list a set of packages that you
want installed and split them into packages you would like attached
and packages you would only like loaded:

```r
packages <- list(loaded = "rstan", attached = "dplyr")
ctx <- context::context_save(root, packages = packages)
```

In this case, the packages in the `loaded` section will be
installed (along with their dependencies) and before anything runs,
we will run `loadNamespace` on them to confirm that they are
properly available.  Access functions in this package with the
double-colon operator, like `dplyr::select`.  However they
will not be attached so will not modify the search path.

In contrast, packages listed in `attached` will be loaded with
`library` so they will be available without qualification (e.g.,
`stanc` and `rstan::stanc` will both work).

### Source files for function definitions

If you define any of your own functions you will need to tell the
cluster about them.  The easiest way to do this is to save them in
a file that contains only function definitions (and does not read
data, etc).

For example, I have a file `mysources.R` with a very simple
simulation in it.  Imagine this is some slow function that given an
integer `n_steps` after a bunch of calculation yields a random walk
of `n_steps` steps starting from point `x`

```r
random_walk <- function(x, n_steps) {
  ret <- numeric(n_steps)
  for (i in seq_len(n_steps)) {
    x <- rnorm(1, x)
    ret[[i]] <- x
  }
  ret
}
```

To set this up, we'd write:


```r
ctx <- context::context_save(root, sources = "mysources.R")
#> [ init:id   ]  6f2a9c5197415536c9916aa99763600b
#> [ init:db   ]  rds
#> [ init:path ]  contexts
#> [ save:id   ]  02a9261fe4e9e6554d76936ecb35cef0
#> [ save:name ]  counterterrorist_xuanhanosaurus
```

`sources` can be a character vector, `NULL` or `character(0)` if
you have no sources, or just omit it.

### Custom packages

If you depend on packages that are not on CRAN (e.g., your personal
research code) you'll need to tell `context` where to find them
with its `package_sources` argument.

If the packages are on GitHub and public you can pass the GitHub
username/repo pair, in `devtools` style:

```r
context::context_save(...,
  package_sources = conan::conan_sources("mrc-ide/dust"))
```

Like with `devtools` you can use subdirectories, specific commits
or tags in the specification.

## Creating the queue

Once a context has been created, we can create a queue with it.
This is separate from the actual cluster queue, but will be our
interface to it.  Running this step takes a while because it
installs all the packages that the cluster will need into the
context directory.


```r
obj <- didehpc::queue_didehpc(ctx)
#> Loading context 02a9261fe4e9e6554d76936ecb35cef0
#> [ context   ]  02a9261fe4e9e6554d76936ecb35cef0
#> [ library   ]
#> [ namespace ]
#> [ source    ]  mysources.R
#> Running installation script on cluster
#>   ,:\      /:.
#>  //  \_()_/  \\
#> ||   |    |   ||  CONAN THE LIBRARIAN
#> ||   |    |   ||  Library:   Q:\didehpc\20210817-145020\contexts\lib\windows\4.0
#> ||   |____|   ||  Bootstrap: T:\conan\bootstrap\4.0
#>  \\  / || \  //   Cache:     Q:\didehpc\20210817-145020\contexts\conan\cache/pkg
#>   `:/  ||  \;'    Policy:    lazy
#>        ||         Repos:
#>        ||           * https://mrc-ide.github.io/didehpc-pkgs
#>        XX           * https://cloud.r-project.org
#>        XX         Packages:
#>        XX           * context
#>        XX
#>        OO
#>        `'
#> i Loading metadata database
#> v Loading metadata database ... done
#>
#> i Getting 9 pkgs (5.29 MB) and 1 pkg with unknown size
#> v Got context 0.3.0 (source) (37.72 kB)
#> v Got ids 1.0.1 (windows) (123.89 kB)
#> v Got R6 2.5.0 (windows) (84.09 kB)
#> v Got askpass 1.1 (windows) (243.58 kB)
#> v Got crayon 1.4.1 (windows) (141.87 kB)
#> v Got digest 0.6.27 (windows) (268.65 kB)
#> v Got uuid 0.1-4 (windows) (33.77 kB)
#> v Got sys 3.4 (windows) (59.83 kB)
#> v Got openssl 1.4.4 (windows) (4.10 MB)
#> v Got storr 1.2.5 (windows) (401.33 kB)
#> v Installed R6 2.5.0  (594ms)
#> v Installed crayon 1.4.1  (766ms)
#> v Installed ids 1.0.1  (860ms)
#> v Installed askpass 1.1  (1s)
#> v Installed sys 3.4  (954ms)
#> v Installed digest 0.6.27  (1.2s)
#> v Installed storr 1.2.5  (1.3s)
#> v Installed uuid 0.1-4  (1s)
#> v Installed openssl 1.4.4  (1.5s)
#> i Building context 0.3.0
#> v Built context 0.3.0 (3.1s)
#> v Installed context 0.3.0  (344ms)
#> v Summary:   10 new  in 12.7s
#> Done!
```

If the above command does not throw an error, then you have
successfully logged in and the cluster is ready to use.

When you first run `queue_didehpc` it will install windows versions of all required packages within the context directory (here, "contexts").  This is necessary even when you are on windows because the cluster cannot see files that are on your computer.

`obj` is a weird sort of object called an `R6` class.  It's a bit
like a Python or Java class if you've come from those languages.
The thing you need to know is that the object is like a list and
contains a number of functions that can be run by running
`obj$functionname()`.  These functions all act by *side effect*;
they interact with a little database stored in the context root
directory or by communicating with the cluster using the web
interface that Wes created.


```r
obj
#> <queue_didehpc>
#>   Inherits from: <queue_base>
#>   Public:
#>     client: web_client, R6
#>     cluster_load: function (cluster = NULL, nodes = TRUE)
#>     config: didehpc_config
#>     context: context
#>     dide_id: function (task_ids)
#>     dide_log: function (task_id)
#>     enqueue: function (expr, envir = parent.frame(), submit = TRUE, name = NULL)
#>     enqueue_: function (expr, envir = parent.frame(), submit = TRUE, name = NULL)
#>     enqueue_bulk: function (X, FUN, ..., do_call = TRUE, envir = parent.frame(),
#>     initialize: function (context, config, root, initialise, provision, login,
#>     initialize_context: function ()
#>     install_packages: function (packages, repos = NULL, policy = "lazy", dryrun = FALSE,
#>     lapply: function (X, FUN, ..., envir = parent.frame(), timeout = 0, time_poll = 1,
#>     login: function (refresh = TRUE)
#>     mapply: function (FUN, ..., MoreArgs = NULL, envir = parent.frame(),
#>     provision_context: function (policy = "verylazy", dryrun = FALSE, quiet = FALSE,
#>     reconcile: function (task_ids = NULL)
#>     rrq_controller: function ()
#>     stop_workers: function (worker_ids = NULL)
#>     submit: function (task_ids, names = NULL)
#>     submit_workers: function (n, timeout = 600, progress = NULL)
#>     task_bundle_get: function (name)
#>     task_bundle_info: function ()
#>     task_bundle_list: function ()
#>     task_delete: function (task_ids)
#>     task_get: function (task_id, check_exists = TRUE)
#>     task_list: function ()
#>     task_result: function (task_id)
#>     task_status: function (task_ids = NULL, named = TRUE)
#>     task_times: function (task_ids = NULL, unit_elapsed = "secs", sorted = TRUE)
#>     unsubmit: function (task_ids)
#>   Private:
#>     data: list
#>     db: storr, R6
#>     lib: queue_library, R6
#>     provisioned: TRUE
#>     root: context_root
#>     submit_or_delete: function (task_ids, name = NULL)
```

For documentation about the individual methods, see help for `didehpc::queue_didehpc` and in queuer (much of this needs writing still!). For example, to see the overall cluster load you can run:


```r
obj$cluster_load(TRUE)
#>            name free used total % used
#> --------------- ---- ---- ----- ------
#> fi--dideclusthn  134   82   216    38%
#>  fi--didemrchnb  730 1050  1780    59%
#>     wpia-hpc-hn  352    0   352     0%
#> --------------- ---- ---- ----- ------
#>         didehpc 1216 1132  2348    48%
```

(if you're on a ANSI-compatible terminal this will be in glorious
Technicolor).

## Testing that the queue works correctly

Before running a real job, let's test that everything works
correctly by running the `sessionInfo` command on the cluster.
When run locally, `sessionInfo` prints information about the state
of your R session:


```r
sessionInfo()
#> R version 4.0.3 (2020-10-10)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 18.04.5 LTS
#>
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
#>
#> locale:
#>  [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C
#>  [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=C
#>  [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8
#>  [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
#>
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base
#>
#> loaded via a namespace (and not attached):
#>  [1] zip_2.1.1         pillar_1.4.7      compiler_4.0.3    prettyunits_1.1.1
#>  [5] tools_4.0.3       digest_0.6.27     pkgbuild_1.2.0    uuid_0.1-4
#>  [9] lifecycle_0.2.0   jsonlite_1.7.1    evaluate_0.14     tibble_3.0.4
#> [13] pkgconfig_2.0.3   rlang_0.4.9       cli_2.2.0         filelock_1.0.2
#> [17] curl_4.3          conan_0.1.1       xfun_0.19         withr_2.3.0
#> [21] queuer_0.3.0      storr_1.2.5       httr_1.4.2        stringr_1.4.0
#> [25] knitr_1.30        vctrs_0.3.5       desc_1.2.0        askpass_1.1
#> [29] didehpc_0.3.6     context_0.3.0     rprojroot_2.0.2   glue_1.4.2
#> [33] R6_2.5.0          processx_3.4.5    rematch_1.0.1     fansi_0.4.1
#> [37] callr_3.5.1       magrittr_2.0.1    rematch2_2.1.2    ids_1.0.1
#> [41] pkgdepends_0.1.0  ps_1.4.0          ellipsis_0.3.1    assertthat_0.2.1
#> [45] lpSolve_5.6.15    stringi_1.5.3     openssl_1.4.3     crayon_1.3.4
```

To run this on the cluster, we wrap it in `obj$enqueue`.  This
prevents the evaluation of the expression and instead organises it
to be run on the cluster:


```r
t <- obj$enqueue(sessionInfo())
```

We can then poll the cluster for results until it completes:


```r
t$wait(100)
#> (-) waiting for d826bc6...9e1, giving up in 99.5 s (\) waiting for
#> d826bc6...9e1, giving up in 98.9 s
#> R version 4.0.3 (2020-10-10)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows Server 2012 R2 x64 (build 9600)
#>
#> Matrix products: default
#>
#> locale:
#> [1] LC_COLLATE=English_United Kingdom.1252
#> [2] LC_CTYPE=English_United Kingdom.1252
#> [3] LC_MONETARY=English_United Kingdom.1252
#> [4] LC_NUMERIC=C
#> [5] LC_TIME=English_United Kingdom.1252
#>
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base
#>
#> loaded via a namespace (and not attached):
#> [1] compiler_4.0.3 R6_2.5.0       context_0.3.0  digest_0.6.27  storr_1.2.5
```

(see the next section for more information about this).

The important part to notice here is that the R "Platform" (second
and third line) is Windows Server, as opposed to the host machine
which is running Linux.  If we had added packages to the context they would be shown too.


## Running single jobs

Let's run something more interesting now by running the `random_walk`
function defined in the `mysources.R` file.

As above, jobs are queued by running:


```r
t <- obj$enqueue(random_walk(0, 10))
```

Like the queue object, `obj`, task objects are R6 objects that can
be used to get information and results back from the task.


```r
t
#> <queuer_task>
#>   Public:
#>     clone: function (deep = FALSE)
#>     context_id: function ()
#>     expr: function (locals = FALSE)
#>     id: c6cfa739a58b925f4919c4d4ebdead1e
#>     initialize: function (id, root, check_exists = TRUE)
#>     log: function (parse = TRUE)
#>     result: function (allow_incomplete = FALSE)
#>     root: context_root
#>     status: function ()
#>     times: function (unit_elapsed = "secs")
#>     wait: function (timeout, time_poll = 0.5, progress = NULL)
```

the task's status


```r
t$status()
#> [1] "PENDING"
```

...which will move from `PENDING` to `RUNNING` to `COMPLETE` or
`ERROR`.  You can get information on submission and running times

and you can try and get the result of running the task:


```r
t$result()
#> Error: task c6cfa739a58b925f4919c4d4ebdead1e is unfetchable: PENDING
```

This errors if the task is not yet complete.

The `wait` function, used above, is like `result` but it will
repeatedly poll for the task to be completed for up to `timeout`
seconds.


```r
t$wait(100)
#> (-) waiting for c6cfa73...d1e, giving up in 99.5 s (\) waiting for
#> c6cfa73...d1e, giving up in 99.0 s
#>  [1] -0.09103302  0.60818181  2.22847428  2.28615058  2.27823970  2.18390202
#>  [7]  4.32781963  4.53920816  3.29061261  2.64571591
```

once the task has completed, `t$result()` and `t$wait` are equivalent


```r
t$result()
#>  [1] -0.09103302  0.60818181  2.22847428  2.28615058  2.27823970  2.18390202
#>  [7]  4.32781963  4.53920816  3.29061261  2.64571591
```

You can query the times of your tasks


```r
t$times()
#>                            task_id           submitted             started
#> 1 c6cfa739a58b925f4919c4d4ebdead1e 2021-08-17 14:50:47 2021-08-17 14:50:48
#>              finished   waiting    running      idle
#> 1 2021-08-17 14:50:48 0.7953935 0.07812715 0.7084134
```

which will show you when the task was submitted, started and stopped.

Every task creates a log:


```r
t$log()
#> [ hello     ]  2021-08-17 14:50:47
#> [ wd        ]  Q:/didehpc/20210817-145020
#> [ init      ]  2021-08-17 14:50:47.946
#> [ hostname  ]  FI--DIDECLUST26
#> [ process   ]  3752
#> [ version   ]  0.3.0
#> [ open:db   ]  rds
#> [ context   ]  02a9261fe4e9e6554d76936ecb35cef0
#> [ library   ]
#> [ namespace ]
#> [ source    ]  mysources.R
#> [ parallel  ]  running as single core job
#> [ root      ]  Q:\didehpc\20210817-145020\contexts
#> [ context   ]  02a9261fe4e9e6554d76936ecb35cef0
#> [ task      ]  c6cfa739a58b925f4919c4d4ebdead1e
#> [ expr      ]  random_walk(0, 10)
#> [ start     ]  2021-08-17 14:50:48.102
#> [ ok        ]
#> [ end       ]  2021-08-17 14:50:48.227
```

Warning messages and other output will be printed here.  So if you
include `message()`, `cat()` or `print()` calls in your task they
will appear between `start` and `end`.


There is another bit of log that happens before this and contains
information about getting the system started up.  You should only
need to look at this when a job seems to get stuck with status
`PENDING` for ages.


```r
obj$dide_log(t)
#>  [1] "generated on host: kea"
#>  [2] "generated on date: 2021-08-17"
#>  [3] "didehpc version: 0.3.6"
#>  [4] "context version: 0.3.0"
#>  [5] "running on: FI--DIDECLUST26"
#>  [6] "mapping Q: -> \\\\fi--san03.dide.ic.ac.uk\\homes\\rfitzjoh"
#>  [7] "The command completed successfully."
#>  [8] ""
#>  [9] "mapping T: -> \\\\fi--didef3.dide.ic.ac.uk\\tmp"
#> [10] "The command completed successfully."
#> [11] ""
#> [12] "Using Rtools at T:\\Rtools\\Rtools40"
#> [13] "working directory: Q:\\didehpc\\20210817-145020"
#> [14] "this is a single task"
#> [15] "logfile: Q:\\didehpc\\20210817-145020\\contexts\\logs\\c6cfa739a58b925f4919c4d4ebdead1e"
#> [16] ""
#> [17] "Q:\\didehpc\\20210817-145020>Rscript \"Q:\\didehpc\\20210817-145020\\contexts\\bin\\task_run\" \"Q:\\didehpc\\20210817-145020\\contexts\" c6cfa739a58b925f4919c4d4ebdead1e  1>\"Q:\\didehpc\\20210817-145020\\contexts\\logs\\c6cfa739a58b925f4919c4d4ebdead1e\" 2>&1"
#> [18] "Removing mapping Q:"
#> [19] "Q: was deleted successfully."
#> [20] ""
#> [21] "Removing mapping T:"
#> [22] "T: was deleted successfully."
#> [23] ""
#> [24] "Quitting"
```

The queue knows which tasks it has created and you can list them:


```r
obj$task_list()
#> [1] "c6cfa739a58b925f4919c4d4ebdead1e" "d826bc66ff934fbc9a76467e1f4f89e1"
```

The long identifiers are random and are long enough that collisions
are unlikely.

Notice that the task ran remotely but we never had to indicate
which filename things were written to.  There is a small database
based on [`storr`](https://richfitz.github.com/storr) that holds
all the information within the context root (here, "contexts").
This means you can close down R and later on regenerate the `ctx`
and `obj` objects and recreate the task objects, and re-get your
results.  But at the same time it provides the _illusion_ that the
cluster has passed an object directly back to you.


```r
id <- t$id
id
#> [1] "c6cfa739a58b925f4919c4d4ebdead1e"
```

```
## [1] "40934f0a0d28ca7385b8eb201b1146b7"
```


```r
t2 <- obj$task_get(id)
t2$result()
#>  [1] -0.09103302  0.60818181  2.22847428  2.28615058  2.27823970  2.18390202
#>  [7]  4.32781963  4.53920816  3.29061261  2.64571591
```

## Running many jobs

There are two broad options here;

1. Apply a function to each element of a list, similar to `lapply`
with `$lapply`
2. Apply a function to each row of a data.frame perhaps using each
column as a different argument with `$enqueue_bulk`

The second approach is more general and `$lapply` is implemented
using it.

Suppose we want to make a bunch of trees of different sizes.  This
would involve mapping our `random_walk` function over a vector of
sizes:


```r
sizes <- 3:8
grp <- obj$lapply(sizes, random_walk, x = 0)
#> Creating bundle: 'sound_nymph'
#> [ bulk      ]  Creating 6 tasks
#> submitting 6 tasks
#> submitting (-) [============>--------------------------] 33% | waited for 0s
#> submitting (\) [===================>-------------------] 50% | waited for 1s
#> submitting (|) [=========================>-------------] 67% | waited for 1s
#> submitting (/) [===============================>-------] 83% | waited for 2s
#> submitting (-) [=======================================] 100% | waited for 2s
```

By default, `$lapply` returns a "task_bundle" with an
automatically generated name.  You can customise the name with the
`name` argument.

In contrast to `lapply` this is not blocking (i.e., submitting
tasks and collecting the results is done asynchronously) but if you
pass a `timeout` argument to `$lapply` then it will poll until the
jobs are done, in the same way as `wait()`, below.

Get the status of all the jobs


```r
grp$status()
#> 8ca54f49ac82893d91c19a13891cb3ee c80f92cc3f3babc837a8ef18603db4f0
#>                       "COMPLETE"                       "COMPLETE"
#> 85935191f16cb04fd814f6df3297158c d5a841a1f7916c8a2eaf645143fe3359
#>                       "COMPLETE"                       "COMPLETE"
#> 22252a47ced26ab45e2d1387bdf4ecb8 5c4ae85ef913ce4e55010feafbb5c2a1
#>                        "PENDING"                        "PENDING"
```

Wait until they are all complete and get the results


```r
res <- grp$wait(120)
#> (-) [==============================================] 100% | giving up in 119 s
```

The other bulk interface is where you want to run a function over a
combination of parameters. Suppose we wanted to run random walks of a number of lengths from a number of starting positions, in all combinations. We might enumerate the possibilities like:


```r
pars <- expand.grid(x = c(-1, 0, 1), n_steps = c(5, 10))
```

We can submit this as a group of 6 jobs with `enqueue_bulk`. Here we add the `timeout` option which makes this a blocking operation:


```r
obj$enqueue_bulk(pars, random_walk, timeout = 120)
#> Creating bundle: 'selfish_anemoneshrimp'
#> [ bulk      ]  Creating 6 tasks
#> submitting 6 tasks
#> submitting (-) [============>--------------------------] 33% | waited for 1s
#> submitting (\) [===================>-------------------] 50% | waited for 1s
#> submitting (|) [=========================>-------------] 67% | waited for 2s
#> submitting (/) [===============================>-------] 83% | waited for 3s
#> submitting (-) [=======================================] 100% | waited for 3s
#> (-) [==============================================] 100% | giving up in 119 s
#> [[1]]
#> [1] -0.5092372 -0.4486443 -0.6431714  0.1341441 -1.1818970
#>
#> [[2]]
#> [1] -1.18855584 -0.02861008  0.82356347  1.51119207  0.35416280
#>
#> [[3]]
#> [1] 1.850285 2.423607 1.597023 1.560501 2.187977
#>
#> [[4]]
#>  [1] -1.0831528 -4.0599794 -2.9609470 -2.1744547 -0.1222570 -0.8703516
#>  [7] -0.9264912 -0.5261700  1.2054075  0.2159083
#>
#> [[5]]
#>  [1]  0.72165118  0.36474247  0.55938912 -0.01860092  0.61308581 -0.80447926
#>  [7]  0.41558732  0.51387762  0.83248026  0.44125104
#>
#> [[6]]
#>  [1] 2.0346518 3.2722265 3.4380578 5.1838941 3.7677278 1.9371159 1.5488215
#>  [8] 1.7675202 0.7970567 1.1976480
```

This has applied the function `random_walk` over each row of `pars`.

## Cancelling and stopping jobs

Suppose you fire off a bunch of jobs and realise that you have the
wrong data or they're all going to fail - you can stop them fairly
easily.

Here's a job that will run for an hour and return nothing:

```r
t <- obj$enqueue(Sys.sleep(3600))
```

Wait for the job to start up:


```r
while (t$status() == "PENDING") {
  Sys.sleep(.5)
}
```

Now that it's started it can be cancelled with the `$unsubmit` method:

```r
obj$unsubmit(t$id)
#> [1] "OK"
```

unsubmitting multiple times is safe, and will have no effect.


```r
obj$unsubmit(t$id)
#> [1] "NOT_RUNNING"
```

Alternatively you can use `obj$task_delete(t$id)` which unsubmits
the task and then deletes it.

Note that the task is not actually deleted (see below); you can
still get at the expression:

```r
t$expr()
#> Sys.sleep(3600)
```

but you cannot retrieve results:

```r
t$result()
#> Error: task 485a1550fa19bb614d571e368d16164a is unfetchable: CANCELLED
```

The argument to `unsubmit` can be a vector.  For example, if you had a task bundle `grp` you could unsubmit all members of the group with

```r
obj$unsubmit(grp$ids)
```

### Deleting jobs

Deleting tasks is supported but it isn't entirely encouraged.  Not
all of the functions behave well with missing tasks, so if you
delete things and still have old task handles floating around you
might get confusing results.

There is a delete method (`obj$task_delete`) that will delete jobs,
first unsubmitting it if it has been submitted.  It takes a vector
of task ids as an argument.

# Misc

## Parallel computation on the cluster

If you are running tasks that can use more than one core, you can
request more resources for your task and use process level
parallelism with the `parallel` package.  To request 8 cores, you
could run:

```r
didehpc::didehpc_config(cores = 8)
```

When your task starts, 8 cores will be allocated to it and a
`parallel` cluster will be created.  You can use it with things
like `parallel::parLapply`, specifying `cl` as `NULL`.  So if
within your cluster job you needed to apply function `f` to a each
element of a list `x`, you could write:

```r
run_f <- function(x) {
  parallel::parLapply(NULL, x, f)
}
obj$enqueue(run_f(x))
```

The parallel bits can be embedded within larger blocks of code.
All functions in `parallel` that take `cl` as a first argument can
be used.  You do not need to (and should not) set up the cluster as
this will happen automatically as the job starts.

Alternatively, if you want to control cluster creation (e.g., you
are using software that does this for you) then, pass
`parallel = FALSE` to the config call:

```r
didehpc::didehpc_config(cores = 8, parallel = FALSE)
```

In this case you are responsible for setting up the cluster.

As an alternative to requesting cores, you can use a different job
template:

```r
didehpc::didehpc_config(template = "16Core")
```

which will reserve you the entire node.  Again, a cluster will be
started with all available cores unless you also specify
`parallel = FALSE`.

## Running heaps of jobs without annoying your colleagues

If you have thousands and thousands of jobs to submit at once you
may not want to flood the cluster with them all at once.  Each job
submission is relatively slow (the HPC tools that the web interface
has to use are relatively slow).  The actual queue that the cluster
uses doesn't seem to like processing tens of thousands of job, and
can slow down.  And if you take up the whole cluster someone may
come and knock on your office and complain at you.  At the same
time, batching your jobs up into little bits and manually sending
them off is a pain and work better done by a computer.

An alternative is to submit a set of "workers" to the cluster, and
then submit jobs to them.  This is done with the
[`rrq`](https://github.com/mrc-ide/rrq) package, along with a
[`redis`](http://redis.io) server running on the cluster.

See the "workers" vignette for details.

# Mapping network drives

For all operating systems, if you are on the wireless network you
will need to connect to the VPN.  If you can get on a wired network
you'll likely have a better time because the VPN and wireless
network seems less stable in general.  Instructions for setting up
a VPN are
[here](https://www1.imperial.ac.uk/publichealth/departments/ide/it/remote)

## Windows

Your network drives are likely already mapped for you.  In fact you
should not even need to map drives as fully qualified network names
(e.g. `//fi--didef3/tmp`) should work for you.

## Mac OS/X

In Finder, go to `Go -> Connect to Server...` or press `Command-K`.
In the address field write the name of the share you want to
connect to.  Useful ones are

* `smb://fi--san03.dide.ic.ac.uk/homes/<username>` -- your home share
* `smb://fi--didef3.dide.ic.ac.uk/tmp` -- the temporary share

At some point in the process you should get prompted for your
username and password, but I can't remember what that looks like.

These directories will be mounted at `/Volumes/<username>` and
`/Volumes/tmp` (so the last bit of the filename will be used as the
mountpoint within `Volumes`).  There may be a better way of doing
this, and the connection will not be reestablished automatically so
if anyone has a better way let me know.

## Linux

This is what I have done for my computer and it seems to work,
though it's not incredibly fast.  Full instructions are [on the Ubuntu community wiki](https://help.ubuntu.com/community/MountWindowsSharesPermanently).

First, install cifs-utils

```
sudo apt-get install cifs-utils
```

In your `/etc/fstab` file, add

```
//fi--san03/homes/<dide-username> <home-mount-point> cifs uid=<local-userid>,gid=<local-groupid>,credentials=/home/<local-username>/.smbcredentials,domain=DIDE,sec=ntlmssp,iocharset=utf8 0  0
//fi--didef3/tmp <tmp-mount-point> cifs uid=<local-userid>,gid=<local-groupid>,credentials=/home/<local-username>/.smbcredentials,domain=DIDE,vers=2.0,sec=ntlmssp,iocharset=utf8 0  0
```

where:

- `<dide-username>` is your dide username without the `DIDE\` bit.
- `<local-username>` is your local username (i.e., `echo $USER`).
- `<local-userid>` is your local numeric user id (i.e. `id -u $USER`)
- `<local-groupid>` is your local numeric group id (i.e. `id -g $USER`)
- `<home-mount-point>` is where you want your DIDE home directory mounted
- `<tmp-mount-point>` is where you want the DIDE temporary directory mounted

**please back this file up before editing**.

So for example, I have:

```
//fi--san03/homes/rfitzjoh /home/rich/net/home cifs uid=1000,gid=1000,credentials=/home/rich/.smbcredentials,domain=DIDE,sec=ntlmssp,iocharset=utf8 0  0
//fi--didef3/tmp /home/rich/net/temp cifs uid=1000,gid=1000,credentials=/home/rich/.smbcredentials,domain=DIDE,sec=ntlmssp,iocharset=utf8 0  0
```

The file `.smbcredentials` contains

```
username=<dide-username>
password=<dide-password>
```

and set this to be chmod 600 for a modicum of security, but be
aware your password is stored in plaintext.

This set up is clearly insecure.  I believe if you omit the
credentials line you can have the system prompt you for a password
interactively, but I'm not sure how that works with automatic
mounting.

Finally, run

```
mount -a
```

to mount all drives and with any luck it will all work and you
don't have to do this until you get a new computer.