--- title: "Using didehpc to run cluster jobs" output: rmarkdown::html_vignette: toc: true toc_depth: 2 vignette: > %\VignetteIndexEntry{Using didehpc to run cluster jobs} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- Parallel computing on a cluster can be more challenging than running things locally because it's often the first time that you need to package up code to run elsewhere, and when things go wrong it's more difficult to get information on why things failed. Much of the difficulty of getting things running involves working out what your code depends on, and getting that installed in the right place on a computer that you can't physically poke at. The next set of problems is dealing with the ballooning set of files that end up being created - templates, scripts, output files, etc. This set of packages ([`didehpc`](https://github.com/mrc-ide/didehpc), [`queuer`](https://github.com/mrc-ide/queuer) and [`context`](https://github.com/mrc-ide/context), along with a couple of support packages ([`conan`](https://github.com/mrc-ide/conan), [`rrq`](https://github.com/mrc-ide/rrq) and [`storr`](https://github.com/richfitz/storr)) aims to remove the pain of getting everything set up, and getting cluster tasks running, and retrieving your results. Once everything is set up, running a job on the cluster should be as straightforward as running things locally. The documentation here runs through a few of the key concepts, then walks through setting this all up. There's also a "quick start" guide that contains much less discussion. ## Functions The biggest conceptual move is from thinking about running **scripts** that generate *files* to running **functions** that return *objects*. The reason for this is that gives a well defined interface to build everything else around. The problem with scripts is that they might do almost anything. They depend on untold files and packages which they load wherever. The produce any number of objects. That's fine, but it becomes hard to reason about them to plan deploying them elsewhere, to capture the outputs appropriately, or to orchestrate looping over a bunch of parameter values. If you've found yourself writing a number of script files changing values with text substitution you have run into this. In contrast, functions do (ideally) one thing. They have a well defined set of inputs (their arguments) and outputs (their return value). We can loop over a range of input values by iterating over a set of arguments. This set of packages tends to work best if you let it look after filenames. Rather than trying to come up with a naming scheme for different files as based on parameter values, just return objects and the packages will arrange for them to be saved and reloaded. ## Filesystems The DIDE cluster needs everything to be available on a filesystem that the cluster can read. Practically this means the filesystems `//fi--didef3.dide.ic.ac.uk/tmp` or `//fi--san03.dide.ic.ac.uk/homes/username` and the like. You probably have access to network shares that are specific to a project, too. For Windows users these are probably mapped to drives (`Q:` or `T:` or similar) already, but for other platforms you will need to do a little extra work to get things set up (see below). It is simplest if *everything* that is needed for a project is present in a single directory that is visible on the cluster. However for the most of this document I will assume that everything is in one directory, which is on a network share. **IMPORTANT**: If you are not sure if you are running on a network share, run `getwd()`; if you are on windows the drive letter should show something like `Q:` or some other drive that represents a network drive. If it says `C:` or similar *nothing below here will work*. # Getting started The initial setup will feel like a headache at first, but it should ultimately take only a few lines. Once everything is set up, then the payback is that is the job submission part will become a lot simpler. ## Installation Install the packages using [`drat`](https://cran.rstudio.com/package=drat) ```r # install.package("drat") # if you don't have it already drat:::add("mrc-ide") install.packages("didehpc") ``` Be sure to run this in a fresh session. ## Configuration The configuration is handled in a two stage process. First, some bits that are machine specific are set using `options` with option names that are prefixed with `didehpc`. Then when a queue is created, further values can be passed along via the `config` argument that will use the "global" options as a default. The reason for this separation is that ideally the machine-specific options will not end up in scripts, because that makes things less portable (for example, we need to get your username, but your username is unlikely to work for your collaborators). Ideally in your ~/.Rprofile file, you will add something like: ```r options( didehpc.username = "rfitzjoh", didehpc.home = "~/net/home") ``` and then set only options (such as cluster and cores or template) that vary with a project. If you use the "big" cluster, you can add `didehpc.cluster = "fi--didemrchnb"` here. (to set this up, try running `usethis::edit_r_profile()`) ### Credentials Windows users will not need to provide anything unless they are on a non-domain machine or they are in the unfortunate situation of juggling multiple usernames across systems. Non-domain machines will need the credentials set as above. Mac users will need to provide their username here as above. If you have a Linux system and have configured your smb mounts as described below, you might as well take advantage of this and set `credentials = "~/.smbcredentials"` and you will never be prompted for your password: ```r options(didehpc.credentials = "~/.smbcredentials") ``` ### Seeing the default configuration To see the configuration that will be run if you don't do anything (else), run: ```r didehpc::didehpc_config() #> #> - cluster: fi--dideclusthn #> - credentials: #> - username: rfitzjoh #> - password: ******************* #> - username: rfitzjoh #> - resource: #> - template: GeneralNodes #> - parallel: FALSE #> - count: 1 #> - type: Cores #> - shares: #> - home: (local) /home/rich/net/home => \\fi--san03.dide.ic.ac.uk\homes\rfitzjoh => Q: (remote) #> - temp: (local) /home/rich/net/temp => \\fi--didef3.dide.ic.ac.uk\tmp => T: (remote) #> - use_workers: FALSE #> - use_rrq: FALSE #> - worker_timeout: 600 #> - conan_bootstrap: TRUE #> - r_version: 4.0.3 #> - use_java: FALSE #> - redis_host: fi--dideclusthn.dide.ic.ac.uk ``` In here you can see the cluster (here, `fi--didemrchnb`), credentials and username, the job template (`GeneralNodes`), information about the resources that will be requested (1 core) and information on filesystem mappings. There are a few other bits of information that may be explained further down. The possible options are explained further in the help for `didehpc::didehpc_config` If you request help, we will almost always want to see this! ### Additional shares If you refer to network shares in your functions, e.g., to refer to data, you'll need to map these too. To do that, pass them as the `shares` argument to `didehpc::didehpc_config`. To describe each share, use the `didehpc::path_mapping` function which takes arguments: * name: a descriptive name for the share * `path_local`: the point where the share is mounted on your computer * `path_remote`: the network path that the share refers to (forward slashes are much easier to enter here than backward slashes) * `drive_remote`: the drive this should be mapped to on the cluster. So to map your "M drive" to which points at `\\fi--didef3.dide.ic.ac.uk\malaria` to `M:` on the cluster you can write ```r share <- didehpc::path_mapping("malaria", "M:", "//fi--didef3.dide.ic.ac.uk/malaria", "M:") config <- didehpc::didehpc_config(shares = share) ``` If you have more than one share to map, pass them through as a list (e.g., `didehpc::didehpc_config(shares = list(share1, share2, ...))`). For most systems we hope that `didehpc` will do a reasonable job of detecting the shares that you are running on, so this should (hopefully) only be necessary for detecting additional shares. The issue there is that you'll need to use absolute paths to refer to the resources and that's going to complicate things... ## Contexts To recreate your work environment on the cluster, we use a package called `context`. This package uses the assumption that most working environments can be recreated by a combination of R packages and sourcing a set of function definitions. ### Root Every context has a "root"; this is the directory that everything will be saved in. Most of the examples in the help use `contexts` which is fairly self explanatory but it can be any directory. Generally it will be in the current directory. ```r root <- "contexts" ``` This directory is going to get large over time and will eventually need to be deleted. Don't treat these as archival storage - more as long-lived temporary directories and don't be afraid to create a new one and delete old ones when you've collected your results. ### Packages If you list packages as a character vector then all packages will be installed for you, and they will also be *attached*; this is what happens when you use the function `library()` So for example if you need to depend on the `rstan` and `dplyr` packages you could write: ```r ctx <- context::context_save(root, packages = c("rstan", "dplyr")) ``` Attaching packages is not always what is wanted, especially if you have packages that clobber functions in base packages (e.g., `dplyr`!). An alternative is to list a set of packages that you want installed and split them into packages you would like attached and packages you would only like loaded: ```r packages <- list(loaded = "rstan", attached = "dplyr") ctx <- context::context_save(root, packages = packages) ``` In this case, the packages in the `loaded` section will be installed (along with their dependencies) and before anything runs, we will run `loadNamespace` on them to confirm that they are properly available. Access functions in this package with the double-colon operator, like `dplyr::select`. However they will not be attached so will not modify the search path. In contrast, packages listed in `attached` will be loaded with `library` so they will be available without qualification (e.g., `stanc` and `rstan::stanc` will both work). ### Source files for function definitions If you define any of your own functions you will need to tell the cluster about them. The easiest way to do this is to save them in a file that contains only function definitions (and does not read data, etc). For example, I have a file `mysources.R` with a very simple simulation in it. Imagine this is some slow function that given an integer `n_steps` after a bunch of calculation yields a random walk of `n_steps` steps starting from point `x` ```r random_walk <- function(x, n_steps) { ret <- numeric(n_steps) for (i in seq_len(n_steps)) { x <- rnorm(1, x) ret[[i]] <- x } ret } ``` To set this up, we'd write: ```r ctx <- context::context_save(root, sources = "mysources.R") #> [ init:id ] 6f2a9c5197415536c9916aa99763600b #> [ init:db ] rds #> [ init:path ] contexts #> [ save:id ] 02a9261fe4e9e6554d76936ecb35cef0 #> [ save:name ] counterterrorist_xuanhanosaurus ``` `sources` can be a character vector, `NULL` or `character(0)` if you have no sources, or just omit it. ### Custom packages If you depend on packages that are not on CRAN (e.g., your personal research code) you'll need to tell `context` where to find them with its `package_sources` argument. If the packages are on GitHub and public you can pass the GitHub username/repo pair, in `devtools` style: ```r context::context_save(..., package_sources = conan::conan_sources("mrc-ide/dust")) ``` Like with `devtools` you can use subdirectories, specific commits or tags in the specification. ## Creating the queue Once a context has been created, we can create a queue with it. This is separate from the actual cluster queue, but will be our interface to it. Running this step takes a while because it installs all the packages that the cluster will need into the context directory. ```r obj <- didehpc::queue_didehpc(ctx) #> Loading context 02a9261fe4e9e6554d76936ecb35cef0 #> [ context ] 02a9261fe4e9e6554d76936ecb35cef0 #> [ library ] #> [ namespace ] #> [ source ] mysources.R #> Running installation script on cluster #> ,:\ /:. #> // \_()_/ \\ #> || | | || CONAN THE LIBRARIAN #> || | | || Library: Q:\didehpc\20210817-145020\contexts\lib\windows\4.0 #> || |____| || Bootstrap: T:\conan\bootstrap\4.0 #> \\ / || \ // Cache: Q:\didehpc\20210817-145020\contexts\conan\cache/pkg #> `:/ || \;' Policy: lazy #> || Repos: #> || * https://mrc-ide.github.io/didehpc-pkgs #> XX * https://cloud.r-project.org #> XX Packages: #> XX * context #> XX #> OO #> `' #> i Loading metadata database #> v Loading metadata database ... done #> #> i Getting 9 pkgs (5.29 MB) and 1 pkg with unknown size #> v Got context 0.3.0 (source) (37.72 kB) #> v Got ids 1.0.1 (windows) (123.89 kB) #> v Got R6 2.5.0 (windows) (84.09 kB) #> v Got askpass 1.1 (windows) (243.58 kB) #> v Got crayon 1.4.1 (windows) (141.87 kB) #> v Got digest 0.6.27 (windows) (268.65 kB) #> v Got uuid 0.1-4 (windows) (33.77 kB) #> v Got sys 3.4 (windows) (59.83 kB) #> v Got openssl 1.4.4 (windows) (4.10 MB) #> v Got storr 1.2.5 (windows) (401.33 kB) #> v Installed R6 2.5.0 (594ms) #> v Installed crayon 1.4.1 (766ms) #> v Installed ids 1.0.1 (860ms) #> v Installed askpass 1.1 (1s) #> v Installed sys 3.4 (954ms) #> v Installed digest 0.6.27 (1.2s) #> v Installed storr 1.2.5 (1.3s) #> v Installed uuid 0.1-4 (1s) #> v Installed openssl 1.4.4 (1.5s) #> i Building context 0.3.0 #> v Built context 0.3.0 (3.1s) #> v Installed context 0.3.0 (344ms) #> v Summary: 10 new in 12.7s #> Done! ``` If the above command does not throw an error, then you have successfully logged in and the cluster is ready to use. When you first run `queue_didehpc` it will install windows versions of all required packages within the context directory (here, "contexts"). This is necessary even when you are on windows because the cluster cannot see files that are on your computer. `obj` is a weird sort of object called an `R6` class. It's a bit like a Python or Java class if you've come from those languages. The thing you need to know is that the object is like a list and contains a number of functions that can be run by running `obj$functionname()`. These functions all act by *side effect*; they interact with a little database stored in the context root directory or by communicating with the cluster using the web interface that Wes created. ```r obj #> #> Inherits from: #> Public: #> client: web_client, R6 #> cluster_load: function (cluster = NULL, nodes = TRUE) #> config: didehpc_config #> context: context #> dide_id: function (task_ids) #> dide_log: function (task_id) #> enqueue: function (expr, envir = parent.frame(), submit = TRUE, name = NULL) #> enqueue_: function (expr, envir = parent.frame(), submit = TRUE, name = NULL) #> enqueue_bulk: function (X, FUN, ..., do_call = TRUE, envir = parent.frame(), #> initialize: function (context, config, root, initialise, provision, login, #> initialize_context: function () #> install_packages: function (packages, repos = NULL, policy = "lazy", dryrun = FALSE, #> lapply: function (X, FUN, ..., envir = parent.frame(), timeout = 0, time_poll = 1, #> login: function (refresh = TRUE) #> mapply: function (FUN, ..., MoreArgs = NULL, envir = parent.frame(), #> provision_context: function (policy = "verylazy", dryrun = FALSE, quiet = FALSE, #> reconcile: function (task_ids = NULL) #> rrq_controller: function () #> stop_workers: function (worker_ids = NULL) #> submit: function (task_ids, names = NULL) #> submit_workers: function (n, timeout = 600, progress = NULL) #> task_bundle_get: function (name) #> task_bundle_info: function () #> task_bundle_list: function () #> task_delete: function (task_ids) #> task_get: function (task_id, check_exists = TRUE) #> task_list: function () #> task_result: function (task_id) #> task_status: function (task_ids = NULL, named = TRUE) #> task_times: function (task_ids = NULL, unit_elapsed = "secs", sorted = TRUE) #> unsubmit: function (task_ids) #> Private: #> data: list #> db: storr, R6 #> lib: queue_library, R6 #> provisioned: TRUE #> root: context_root #> submit_or_delete: function (task_ids, name = NULL) ``` For documentation about the individual methods, see help for `didehpc::queue_didehpc` and in queuer (much of this needs writing still!). For example, to see the overall cluster load you can run: ```r obj$cluster_load(TRUE) #> name free used total % used #> --------------- ---- ---- ----- ------ #> fi--dideclusthn 134 82 216 38% #> fi--didemrchnb 730 1050 1780 59% #> wpia-hpc-hn 352 0 352 0% #> --------------- ---- ---- ----- ------ #> didehpc 1216 1132 2348 48% ``` (if you're on a ANSI-compatible terminal this will be in glorious Technicolor). ## Testing that the queue works correctly Before running a real job, let's test that everything works correctly by running the `sessionInfo` command on the cluster. When run locally, `sessionInfo` prints information about the state of your R session: ```r sessionInfo() #> R version 4.0.3 (2020-10-10) #> Platform: x86_64-pc-linux-gnu (64-bit) #> Running under: Ubuntu 18.04.5 LTS #> #> Matrix products: default #> BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1 #> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1 #> #> locale: #> [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C #> [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=C #> [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8 #> [7] LC_PAPER=en_GB.UTF-8 LC_NAME=C #> [9] LC_ADDRESS=C LC_TELEPHONE=C #> [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C #> #> attached base packages: #> [1] stats graphics grDevices utils datasets methods base #> #> loaded via a namespace (and not attached): #> [1] zip_2.1.1 pillar_1.4.7 compiler_4.0.3 prettyunits_1.1.1 #> [5] tools_4.0.3 digest_0.6.27 pkgbuild_1.2.0 uuid_0.1-4 #> [9] lifecycle_0.2.0 jsonlite_1.7.1 evaluate_0.14 tibble_3.0.4 #> [13] pkgconfig_2.0.3 rlang_0.4.9 cli_2.2.0 filelock_1.0.2 #> [17] curl_4.3 conan_0.1.1 xfun_0.19 withr_2.3.0 #> [21] queuer_0.3.0 storr_1.2.5 httr_1.4.2 stringr_1.4.0 #> [25] knitr_1.30 vctrs_0.3.5 desc_1.2.0 askpass_1.1 #> [29] didehpc_0.3.6 context_0.3.0 rprojroot_2.0.2 glue_1.4.2 #> [33] R6_2.5.0 processx_3.4.5 rematch_1.0.1 fansi_0.4.1 #> [37] callr_3.5.1 magrittr_2.0.1 rematch2_2.1.2 ids_1.0.1 #> [41] pkgdepends_0.1.0 ps_1.4.0 ellipsis_0.3.1 assertthat_0.2.1 #> [45] lpSolve_5.6.15 stringi_1.5.3 openssl_1.4.3 crayon_1.3.4 ``` To run this on the cluster, we wrap it in `obj$enqueue`. This prevents the evaluation of the expression and instead organises it to be run on the cluster: ```r t <- obj$enqueue(sessionInfo()) ``` We can then poll the cluster for results until it completes: ```r t$wait(100) #> (-) waiting for d826bc6...9e1, giving up in 99.5 s (\) waiting for #> d826bc6...9e1, giving up in 98.9 s #> R version 4.0.3 (2020-10-10) #> Platform: x86_64-w64-mingw32/x64 (64-bit) #> Running under: Windows Server 2012 R2 x64 (build 9600) #> #> Matrix products: default #> #> locale: #> [1] LC_COLLATE=English_United Kingdom.1252 #> [2] LC_CTYPE=English_United Kingdom.1252 #> [3] LC_MONETARY=English_United Kingdom.1252 #> [4] LC_NUMERIC=C #> [5] LC_TIME=English_United Kingdom.1252 #> #> attached base packages: #> [1] stats graphics grDevices utils datasets methods base #> #> loaded via a namespace (and not attached): #> [1] compiler_4.0.3 R6_2.5.0 context_0.3.0 digest_0.6.27 storr_1.2.5 ``` (see the next section for more information about this). The important part to notice here is that the R "Platform" (second and third line) is Windows Server, as opposed to the host machine which is running Linux. If we had added packages to the context they would be shown too. ## Running single jobs Let's run something more interesting now by running the `random_walk` function defined in the `mysources.R` file. As above, jobs are queued by running: ```r t <- obj$enqueue(random_walk(0, 10)) ``` Like the queue object, `obj`, task objects are R6 objects that can be used to get information and results back from the task. ```r t #> #> Public: #> clone: function (deep = FALSE) #> context_id: function () #> expr: function (locals = FALSE) #> id: c6cfa739a58b925f4919c4d4ebdead1e #> initialize: function (id, root, check_exists = TRUE) #> log: function (parse = TRUE) #> result: function (allow_incomplete = FALSE) #> root: context_root #> status: function () #> times: function (unit_elapsed = "secs") #> wait: function (timeout, time_poll = 0.5, progress = NULL) ``` the task's status ```r t$status() #> [1] "PENDING" ``` ...which will move from `PENDING` to `RUNNING` to `COMPLETE` or `ERROR`. You can get information on submission and running times and you can try and get the result of running the task: ```r t$result() #> Error: task c6cfa739a58b925f4919c4d4ebdead1e is unfetchable: PENDING ``` This errors if the task is not yet complete. The `wait` function, used above, is like `result` but it will repeatedly poll for the task to be completed for up to `timeout` seconds. ```r t$wait(100) #> (-) waiting for c6cfa73...d1e, giving up in 99.5 s (\) waiting for #> c6cfa73...d1e, giving up in 99.0 s #> [1] -0.09103302 0.60818181 2.22847428 2.28615058 2.27823970 2.18390202 #> [7] 4.32781963 4.53920816 3.29061261 2.64571591 ``` once the task has completed, `t$result()` and `t$wait` are equivalent ```r t$result() #> [1] -0.09103302 0.60818181 2.22847428 2.28615058 2.27823970 2.18390202 #> [7] 4.32781963 4.53920816 3.29061261 2.64571591 ``` You can query the times of your tasks ```r t$times() #> task_id submitted started #> 1 c6cfa739a58b925f4919c4d4ebdead1e 2021-08-17 14:50:47 2021-08-17 14:50:48 #> finished waiting running idle #> 1 2021-08-17 14:50:48 0.7953935 0.07812715 0.7084134 ``` which will show you when the task was submitted, started and stopped. Every task creates a log: ```r t$log() #> [ hello ] 2021-08-17 14:50:47 #> [ wd ] Q:/didehpc/20210817-145020 #> [ init ] 2021-08-17 14:50:47.946 #> [ hostname ] FI--DIDECLUST26 #> [ process ] 3752 #> [ version ] 0.3.0 #> [ open:db ] rds #> [ context ] 02a9261fe4e9e6554d76936ecb35cef0 #> [ library ] #> [ namespace ] #> [ source ] mysources.R #> [ parallel ] running as single core job #> [ root ] Q:\didehpc\20210817-145020\contexts #> [ context ] 02a9261fe4e9e6554d76936ecb35cef0 #> [ task ] c6cfa739a58b925f4919c4d4ebdead1e #> [ expr ] random_walk(0, 10) #> [ start ] 2021-08-17 14:50:48.102 #> [ ok ] #> [ end ] 2021-08-17 14:50:48.227 ``` Warning messages and other output will be printed here. So if you include `message()`, `cat()` or `print()` calls in your task they will appear between `start` and `end`. There is another bit of log that happens before this and contains information about getting the system started up. You should only need to look at this when a job seems to get stuck with status `PENDING` for ages. ```r obj$dide_log(t) #> [1] "generated on host: kea" #> [2] "generated on date: 2021-08-17" #> [3] "didehpc version: 0.3.6" #> [4] "context version: 0.3.0" #> [5] "running on: FI--DIDECLUST26" #> [6] "mapping Q: -> \\\\fi--san03.dide.ic.ac.uk\\homes\\rfitzjoh" #> [7] "The command completed successfully." #> [8] "" #> [9] "mapping T: -> \\\\fi--didef3.dide.ic.ac.uk\\tmp" #> [10] "The command completed successfully." #> [11] "" #> [12] "Using Rtools at T:\\Rtools\\Rtools40" #> [13] "working directory: Q:\\didehpc\\20210817-145020" #> [14] "this is a single task" #> [15] "logfile: Q:\\didehpc\\20210817-145020\\contexts\\logs\\c6cfa739a58b925f4919c4d4ebdead1e" #> [16] "" #> [17] "Q:\\didehpc\\20210817-145020>Rscript \"Q:\\didehpc\\20210817-145020\\contexts\\bin\\task_run\" \"Q:\\didehpc\\20210817-145020\\contexts\" c6cfa739a58b925f4919c4d4ebdead1e 1>\"Q:\\didehpc\\20210817-145020\\contexts\\logs\\c6cfa739a58b925f4919c4d4ebdead1e\" 2>&1" #> [18] "Removing mapping Q:" #> [19] "Q: was deleted successfully." #> [20] "" #> [21] "Removing mapping T:" #> [22] "T: was deleted successfully." #> [23] "" #> [24] "Quitting" ``` The queue knows which tasks it has created and you can list them: ```r obj$task_list() #> [1] "c6cfa739a58b925f4919c4d4ebdead1e" "d826bc66ff934fbc9a76467e1f4f89e1" ``` The long identifiers are random and are long enough that collisions are unlikely. Notice that the task ran remotely but we never had to indicate which filename things were written to. There is a small database based on [`storr`](https://richfitz.github.com/storr) that holds all the information within the context root (here, "contexts"). This means you can close down R and later on regenerate the `ctx` and `obj` objects and recreate the task objects, and re-get your results. But at the same time it provides the _illusion_ that the cluster has passed an object directly back to you. ```r id <- t$id id #> [1] "c6cfa739a58b925f4919c4d4ebdead1e" ``` ``` ## [1] "40934f0a0d28ca7385b8eb201b1146b7" ``` ```r t2 <- obj$task_get(id) t2$result() #> [1] -0.09103302 0.60818181 2.22847428 2.28615058 2.27823970 2.18390202 #> [7] 4.32781963 4.53920816 3.29061261 2.64571591 ``` ## Running many jobs There are two broad options here; 1. Apply a function to each element of a list, similar to `lapply` with `$lapply` 2. Apply a function to each row of a data.frame perhaps using each column as a different argument with `$enqueue_bulk` The second approach is more general and `$lapply` is implemented using it. Suppose we want to make a bunch of trees of different sizes. This would involve mapping our `random_walk` function over a vector of sizes: ```r sizes <- 3:8 grp <- obj$lapply(sizes, random_walk, x = 0) #> Creating bundle: 'sound_nymph' #> [ bulk ] Creating 6 tasks #> submitting 6 tasks #> submitting (-) [============>--------------------------] 33% | waited for 0s #> submitting (\) [===================>-------------------] 50% | waited for 1s #> submitting (|) [=========================>-------------] 67% | waited for 1s #> submitting (/) [===============================>-------] 83% | waited for 2s #> submitting (-) [=======================================] 100% | waited for 2s ``` By default, `$lapply` returns a "task_bundle" with an automatically generated name. You can customise the name with the `name` argument. In contrast to `lapply` this is not blocking (i.e., submitting tasks and collecting the results is done asynchronously) but if you pass a `timeout` argument to `$lapply` then it will poll until the jobs are done, in the same way as `wait()`, below. Get the status of all the jobs ```r grp$status() #> 8ca54f49ac82893d91c19a13891cb3ee c80f92cc3f3babc837a8ef18603db4f0 #> "COMPLETE" "COMPLETE" #> 85935191f16cb04fd814f6df3297158c d5a841a1f7916c8a2eaf645143fe3359 #> "COMPLETE" "COMPLETE" #> 22252a47ced26ab45e2d1387bdf4ecb8 5c4ae85ef913ce4e55010feafbb5c2a1 #> "PENDING" "PENDING" ``` Wait until they are all complete and get the results ```r res <- grp$wait(120) #> (-) [==============================================] 100% | giving up in 119 s ``` The other bulk interface is where you want to run a function over a combination of parameters. Suppose we wanted to run random walks of a number of lengths from a number of starting positions, in all combinations. We might enumerate the possibilities like: ```r pars <- expand.grid(x = c(-1, 0, 1), n_steps = c(5, 10)) ``` We can submit this as a group of 6 jobs with `enqueue_bulk`. Here we add the `timeout` option which makes this a blocking operation: ```r obj$enqueue_bulk(pars, random_walk, timeout = 120) #> Creating bundle: 'selfish_anemoneshrimp' #> [ bulk ] Creating 6 tasks #> submitting 6 tasks #> submitting (-) [============>--------------------------] 33% | waited for 1s #> submitting (\) [===================>-------------------] 50% | waited for 1s #> submitting (|) [=========================>-------------] 67% | waited for 2s #> submitting (/) [===============================>-------] 83% | waited for 3s #> submitting (-) [=======================================] 100% | waited for 3s #> (-) [==============================================] 100% | giving up in 119 s #> [[1]] #> [1] -0.5092372 -0.4486443 -0.6431714 0.1341441 -1.1818970 #> #> [[2]] #> [1] -1.18855584 -0.02861008 0.82356347 1.51119207 0.35416280 #> #> [[3]] #> [1] 1.850285 2.423607 1.597023 1.560501 2.187977 #> #> [[4]] #> [1] -1.0831528 -4.0599794 -2.9609470 -2.1744547 -0.1222570 -0.8703516 #> [7] -0.9264912 -0.5261700 1.2054075 0.2159083 #> #> [[5]] #> [1] 0.72165118 0.36474247 0.55938912 -0.01860092 0.61308581 -0.80447926 #> [7] 0.41558732 0.51387762 0.83248026 0.44125104 #> #> [[6]] #> [1] 2.0346518 3.2722265 3.4380578 5.1838941 3.7677278 1.9371159 1.5488215 #> [8] 1.7675202 0.7970567 1.1976480 ``` This has applied the function `random_walk` over each row of `pars`. ## Cancelling and stopping jobs Suppose you fire off a bunch of jobs and realise that you have the wrong data or they're all going to fail - you can stop them fairly easily. Here's a job that will run for an hour and return nothing: ```r t <- obj$enqueue(Sys.sleep(3600)) ``` Wait for the job to start up: ```r while (t$status() == "PENDING") { Sys.sleep(.5) } ``` Now that it's started it can be cancelled with the `$unsubmit` method: ```r obj$unsubmit(t$id) #> [1] "OK" ``` unsubmitting multiple times is safe, and will have no effect. ```r obj$unsubmit(t$id) #> [1] "NOT_RUNNING" ``` Alternatively you can use `obj$task_delete(t$id)` which unsubmits the task and then deletes it. Note that the task is not actually deleted (see below); you can still get at the expression: ```r t$expr() #> Sys.sleep(3600) ``` but you cannot retrieve results: ```r t$result() #> Error: task 485a1550fa19bb614d571e368d16164a is unfetchable: CANCELLED ``` The argument to `unsubmit` can be a vector. For example, if you had a task bundle `grp` you could unsubmit all members of the group with ```r obj$unsubmit(grp$ids) ``` ### Deleting jobs Deleting tasks is supported but it isn't entirely encouraged. Not all of the functions behave well with missing tasks, so if you delete things and still have old task handles floating around you might get confusing results. There is a delete method (`obj$task_delete`) that will delete jobs, first unsubmitting it if it has been submitted. It takes a vector of task ids as an argument. # Misc ## Parallel computation on the cluster If you are running tasks that can use more than one core, you can request more resources for your task and use process level parallelism with the `parallel` package. To request 8 cores, you could run: ```r didehpc::didehpc_config(cores = 8) ``` When your task starts, 8 cores will be allocated to it and a `parallel` cluster will be created. You can use it with things like `parallel::parLapply`, specifying `cl` as `NULL`. So if within your cluster job you needed to apply function `f` to a each element of a list `x`, you could write: ```r run_f <- function(x) { parallel::parLapply(NULL, x, f) } obj$enqueue(run_f(x)) ``` The parallel bits can be embedded within larger blocks of code. All functions in `parallel` that take `cl` as a first argument can be used. You do not need to (and should not) set up the cluster as this will happen automatically as the job starts. Alternatively, if you want to control cluster creation (e.g., you are using software that does this for you) then, pass `parallel = FALSE` to the config call: ```r didehpc::didehpc_config(cores = 8, parallel = FALSE) ``` In this case you are responsible for setting up the cluster. As an alternative to requesting cores, you can use a different job template: ```r didehpc::didehpc_config(template = "16Core") ``` which will reserve you the entire node. Again, a cluster will be started with all available cores unless you also specify `parallel = FALSE`. ## Running heaps of jobs without annoying your colleagues If you have thousands and thousands of jobs to submit at once you may not want to flood the cluster with them all at once. Each job submission is relatively slow (the HPC tools that the web interface has to use are relatively slow). The actual queue that the cluster uses doesn't seem to like processing tens of thousands of job, and can slow down. And if you take up the whole cluster someone may come and knock on your office and complain at you. At the same time, batching your jobs up into little bits and manually sending them off is a pain and work better done by a computer. An alternative is to submit a set of "workers" to the cluster, and then submit jobs to them. This is done with the [`rrq`](https://github.com/mrc-ide/rrq) package, along with a [`redis`](http://redis.io) server running on the cluster. See the "workers" vignette for details. # Mapping network drives For all operating systems, if you are on the wireless network you will need to connect to the VPN. If you can get on a wired network you'll likely have a better time because the VPN and wireless network seems less stable in general. Instructions for setting up a VPN are [here](https://www1.imperial.ac.uk/publichealth/departments/ide/it/remote) ## Windows Your network drives are likely already mapped for you. In fact you should not even need to map drives as fully qualified network names (e.g. `//fi--didef3/tmp`) should work for you. ## Mac OS/X In Finder, go to `Go -> Connect to Server...` or press `Command-K`. In the address field write the name of the share you want to connect to. Useful ones are * `smb://fi--san03.dide.ic.ac.uk/homes/` -- your home share * `smb://fi--didef3.dide.ic.ac.uk/tmp` -- the temporary share At some point in the process you should get prompted for your username and password, but I can't remember what that looks like. These directories will be mounted at `/Volumes/` and `/Volumes/tmp` (so the last bit of the filename will be used as the mountpoint within `Volumes`). There may be a better way of doing this, and the connection will not be reestablished automatically so if anyone has a better way let me know. ## Linux This is what I have done for my computer and it seems to work, though it's not incredibly fast. Full instructions are [on the Ubuntu community wiki](https://help.ubuntu.com/community/MountWindowsSharesPermanently). First, install cifs-utils ``` sudo apt-get install cifs-utils ``` In your `/etc/fstab` file, add ``` //fi--san03/homes/ cifs uid=,gid=,credentials=/home//.smbcredentials,domain=DIDE,sec=ntlmssp,iocharset=utf8 0 0 //fi--didef3/tmp cifs uid=,gid=,credentials=/home//.smbcredentials,domain=DIDE,vers=2.0,sec=ntlmssp,iocharset=utf8 0 0 ``` where: - `` is your dide username without the `DIDE\` bit. - `` is your local username (i.e., `echo $USER`). - `` is your local numeric user id (i.e. `id -u $USER`) - `` is your local numeric group id (i.e. `id -g $USER`) - `` is where you want your DIDE home directory mounted - `` is where you want the DIDE temporary directory mounted **please back this file up before editing**. So for example, I have: ``` //fi--san03/homes/rfitzjoh /home/rich/net/home cifs uid=1000,gid=1000,credentials=/home/rich/.smbcredentials,domain=DIDE,sec=ntlmssp,iocharset=utf8 0 0 //fi--didef3/tmp /home/rich/net/temp cifs uid=1000,gid=1000,credentials=/home/rich/.smbcredentials,domain=DIDE,sec=ntlmssp,iocharset=utf8 0 0 ``` The file `.smbcredentials` contains ``` username= password= ``` and set this to be chmod 600 for a modicum of security, but be aware your password is stored in plaintext. This set up is clearly insecure. I believe if you omit the credentials line you can have the system prompt you for a password interactively, but I'm not sure how that works with automatic mounting. Finally, run ``` mount -a ``` to mount all drives and with any luck it will all work and you don't have to do this until you get a new computer.