Parallel computing on a cluster can be more challenging than running things locally because it’s often the first time that you need to package up code to run elsewhere, and when things go wrong it’s more difficult to get information on why things failed.
Much of the difficulty of getting things running involves working out what your code depends on, and getting that installed in the right place on a computer that you can’t physically poke at. The next set of problems is dealing with the ballooning set of files that end up being created - templates, scripts, output files, etc.
This set of packages (didehpc
, queuer
and context
,
along with a couple of support packages (conan
, rrq
and storr
) aims to
remove the pain of getting everything set up, and getting cluster tasks
running, and retrieving your results.
Once everything is set up, running a job on the cluster should be as straightforward as running things locally.
The documentation here runs through a few of the key concepts, then walks through setting this all up. There’s also a “quick start” guide that contains much less discussion.
The biggest conceptual move is from thinking about running scripts that generate files to running functions that return objects. The reason for this is that gives a well defined interface to build everything else around.
The problem with scripts is that they might do almost anything. They depend on untold files and packages which they load wherever. The produce any number of objects. That’s fine, but it becomes hard to reason about them to plan deploying them elsewhere, to capture the outputs appropriately, or to orchestrate looping over a bunch of parameter values. If you’ve found yourself writing a number of script files changing values with text substitution you have run into this.
In contrast, functions do (ideally) one thing. They have a well defined set of inputs (their arguments) and outputs (their return value). We can loop over a range of input values by iterating over a set of arguments.
This set of packages tends to work best if you let it look after filenames. Rather than trying to come up with a naming scheme for different files as based on parameter values, just return objects and the packages will arrange for them to be saved and reloaded.
The DIDE cluster needs everything to be available on a filesystem
that the cluster can read. Practically this means the filesystems
//fi--didef3.dide.ic.ac.uk/tmp
or
//fi--san03.dide.ic.ac.uk/homes/username
and the like. You
probably have access to network shares that are specific to a project,
too. For Windows users these are probably mapped to drives
(Q:
or T:
or similar) already, but for other
platforms you will need to do a little extra work to get things set up
(see below).
It is simplest if everything that is needed for a project is present in a single directory that is visible on the cluster.
However for the most of this document I will assume that everything is in one directory, which is on a network share.
IMPORTANT: If you are not sure if you are running on
a network share, run getwd()
; if you are on windows the
drive letter should show something like Q:
or some other
drive that represents a network drive. If it says C:
or
similar nothing below here will work.
The initial setup will feel like a headache at first, but it should ultimately take only a few lines. Once everything is set up, then the payback is that is the job submission part will become a lot simpler.
Install the packages using drat
# install.package("drat") # if you don't have it already
drat:::add("mrc-ide")
install.packages("didehpc")
Be sure to run this in a fresh session.
The configuration is handled in a two stage process. First, some bits
that are machine specific are set using options
with option
names that are prefixed with didehpc
. Then when a queue is
created, further values can be passed along via the config
argument that will use the “global” options as a default.
The reason for this separation is that ideally the machine-specific options will not end up in scripts, because that makes things less portable (for example, we need to get your username, but your username is unlikely to work for your collaborators).
Ideally in your ~/.Rprofile file, you will add something like:
and then set only options (such as cluster and cores or template) that vary with a project.
If you use the “big” cluster, you can add
didehpc.cluster = "fi--didemrchnb"
here.
(to set this up, try running
usethis::edit_r_profile()
)
Windows users will not need to provide anything unless they are on a non-domain machine or they are in the unfortunate situation of juggling multiple usernames across systems. Non-domain machines will need the credentials set as above.
Mac users will need to provide their username here as above.
If you have a Linux system and have configured your smb mounts as
described below, you might as well take advantage of this and set
credentials = "~/.smbcredentials"
and you will never be
prompted for your password:
To see the configuration that will be run if you don’t do anything (else), run:
didehpc::didehpc_config()
#> <didehpc_config>
#> - cluster: fi--dideclusthn
#> - credentials:
#> - username: rfitzjoh
#> - password: *******************
#> - username: rfitzjoh
#> - resource:
#> - template: GeneralNodes
#> - parallel: FALSE
#> - count: 1
#> - type: Cores
#> - shares:
#> - home: (local) /home/rich/net/home => \\fi--san03.dide.ic.ac.uk\homes\rfitzjoh => Q: (remote)
#> - temp: (local) /home/rich/net/temp => \\fi--didef3.dide.ic.ac.uk\tmp => T: (remote)
#> - use_workers: FALSE
#> - use_rrq: FALSE
#> - worker_timeout: 600
#> - conan_bootstrap: TRUE
#> - r_version: 4.0.3
#> - use_java: FALSE
#> - redis_host: fi--dideclusthn.dide.ic.ac.uk
In here you can see the cluster (here, fi--didemrchnb
),
credentials and username, the job template (GeneralNodes
),
information about the resources that will be requested (1 core) and
information on filesystem mappings. There are a few other bits of
information that may be explained further down. The possible options are
explained further in the help for
didehpc::didehpc_config
If you request help, we will almost always want to see this!
To recreate your work environment on the cluster, we use a package
called context
. This package uses the assumption that most
working environments can be recreated by a combination of R packages and
sourcing a set of function definitions.
Every context has a “root”; this is the directory that everything
will be saved in. Most of the examples in the help use
contexts
which is fairly self explanatory but it can be any
directory. Generally it will be in the current directory.
This directory is going to get large over time and will eventually need to be deleted. Don’t treat these as archival storage - more as long-lived temporary directories and don’t be afraid to create a new one and delete old ones when you’ve collected your results.
If you list packages as a character vector then all packages will be
installed for you, and they will also be attached; this is what
happens when you use the function library()
So for example
if you need to depend on the rstan
and dplyr
packages you could write:
Attaching packages is not always what is wanted, especially if you
have packages that clobber functions in base packages (e.g.,
dplyr
!). An alternative is to list a set of packages that
you want installed and split them into packages you would like attached
and packages you would only like loaded:
packages <- list(loaded = "rstan", attached = "dplyr")
ctx <- context::context_save(root, packages = packages)
In this case, the packages in the loaded
section will be
installed (along with their dependencies) and before anything runs, we
will run loadNamespace
on them to confirm that they are
properly available. Access functions in this package with the
double-colon operator, like dplyr::select
. However they
will not be attached so will not modify the search path.
In contrast, packages listed in attached
will be loaded
with library
so they will be available without
qualification (e.g., stanc
and rstan::stanc
will both work).
If you define any of your own functions you will need to tell the cluster about them. The easiest way to do this is to save them in a file that contains only function definitions (and does not read data, etc).
For example, I have a file mysources.R
with a very
simple simulation in it. Imagine this is some slow function that given
an integer n_steps
after a bunch of calculation yields a
random walk of n_steps
steps starting from point
x
random_walk <- function(x, n_steps) {
ret <- numeric(n_steps)
for (i in seq_len(n_steps)) {
x <- rnorm(1, x)
ret[[i]] <- x
}
ret
}
To set this up, we’d write:
ctx <- context::context_save(root, sources = "mysources.R")
#> [ init:id ] 6f2a9c5197415536c9916aa99763600b
#> [ init:db ] rds
#> [ init:path ] contexts
#> [ save:id ] 02a9261fe4e9e6554d76936ecb35cef0
#> [ save:name ] counterterrorist_xuanhanosaurus
sources
can be a character vector, NULL
or
character(0)
if you have no sources, or just omit it.
If you depend on packages that are not on CRAN (e.g., your personal
research code) you’ll need to tell context
where to find
them with its package_sources
argument.
If the packages are on GitHub and public you can pass the GitHub
username/repo pair, in devtools
style:
Like with devtools
you can use subdirectories, specific
commits or tags in the specification.
Once a context has been created, we can create a queue with it. This is separate from the actual cluster queue, but will be our interface to it. Running this step takes a while because it installs all the packages that the cluster will need into the context directory.
obj <- didehpc::queue_didehpc(ctx)
#> Loading context 02a9261fe4e9e6554d76936ecb35cef0
#> [ context ] 02a9261fe4e9e6554d76936ecb35cef0
#> [ library ]
#> [ namespace ]
#> [ source ] mysources.R
#> Running installation script on cluster
#> ,:\ /:.
#> // \_()_/ \\
#> || | | || CONAN THE LIBRARIAN
#> || | | || Library: Q:\didehpc\20210817-145020\contexts\lib\windows\4.0
#> || |____| || Bootstrap: T:\conan\bootstrap\4.0
#> \\ / || \ // Cache: Q:\didehpc\20210817-145020\contexts\conan\cache/pkg
#> `:/ || \;' Policy: lazy
#> || Repos:
#> || * https://mrc-ide.github.io/didehpc-pkgs
#> XX * https://cloud.r-project.org
#> XX Packages:
#> XX * context
#> XX
#> OO
#> `'
#> i Loading metadata database
#> v Loading metadata database ... done
#>
#> i Getting 9 pkgs (5.29 MB) and 1 pkg with unknown size
#> v Got context 0.3.0 (source) (37.72 kB)
#> v Got ids 1.0.1 (windows) (123.89 kB)
#> v Got R6 2.5.0 (windows) (84.09 kB)
#> v Got askpass 1.1 (windows) (243.58 kB)
#> v Got crayon 1.4.1 (windows) (141.87 kB)
#> v Got digest 0.6.27 (windows) (268.65 kB)
#> v Got uuid 0.1-4 (windows) (33.77 kB)
#> v Got sys 3.4 (windows) (59.83 kB)
#> v Got openssl 1.4.4 (windows) (4.10 MB)
#> v Got storr 1.2.5 (windows) (401.33 kB)
#> v Installed R6 2.5.0 (594ms)
#> v Installed crayon 1.4.1 (766ms)
#> v Installed ids 1.0.1 (860ms)
#> v Installed askpass 1.1 (1s)
#> v Installed sys 3.4 (954ms)
#> v Installed digest 0.6.27 (1.2s)
#> v Installed storr 1.2.5 (1.3s)
#> v Installed uuid 0.1-4 (1s)
#> v Installed openssl 1.4.4 (1.5s)
#> i Building context 0.3.0
#> v Built context 0.3.0 (3.1s)
#> v Installed context 0.3.0 (344ms)
#> v Summary: 10 new in 12.7s
#> Done!
If the above command does not throw an error, then you have successfully logged in and the cluster is ready to use.
When you first run queue_didehpc
it will install windows
versions of all required packages within the context directory (here,
“contexts”). This is necessary even when you are on windows because the
cluster cannot see files that are on your computer.
obj
is a weird sort of object called an R6
class. It’s a bit like a Python or Java class if you’ve come from those
languages. The thing you need to know is that the object is like a list
and contains a number of functions that can be run by running
obj$functionname()
. These functions all act by side
effect; they interact with a little database stored in the context
root directory or by communicating with the cluster using the web
interface that Wes created.
obj
#> <queue_didehpc>
#> Inherits from: <queue_base>
#> Public:
#> client: web_client, R6
#> cluster_load: function (cluster = NULL, nodes = TRUE)
#> config: didehpc_config
#> context: context
#> dide_id: function (task_ids)
#> dide_log: function (task_id)
#> enqueue: function (expr, envir = parent.frame(), submit = TRUE, name = NULL)
#> enqueue_: function (expr, envir = parent.frame(), submit = TRUE, name = NULL)
#> enqueue_bulk: function (X, FUN, ..., do_call = TRUE, envir = parent.frame(),
#> initialize: function (context, config, root, initialise, provision, login,
#> initialize_context: function ()
#> install_packages: function (packages, repos = NULL, policy = "lazy", dryrun = FALSE,
#> lapply: function (X, FUN, ..., envir = parent.frame(), timeout = 0, time_poll = 1,
#> login: function (refresh = TRUE)
#> mapply: function (FUN, ..., MoreArgs = NULL, envir = parent.frame(),
#> provision_context: function (policy = "verylazy", dryrun = FALSE, quiet = FALSE,
#> reconcile: function (task_ids = NULL)
#> rrq_controller: function ()
#> stop_workers: function (worker_ids = NULL)
#> submit: function (task_ids, names = NULL)
#> submit_workers: function (n, timeout = 600, progress = NULL)
#> task_bundle_get: function (name)
#> task_bundle_info: function ()
#> task_bundle_list: function ()
#> task_delete: function (task_ids)
#> task_get: function (task_id, check_exists = TRUE)
#> task_list: function ()
#> task_result: function (task_id)
#> task_status: function (task_ids = NULL, named = TRUE)
#> task_times: function (task_ids = NULL, unit_elapsed = "secs", sorted = TRUE)
#> unsubmit: function (task_ids)
#> Private:
#> data: list
#> db: storr, R6
#> lib: queue_library, R6
#> provisioned: TRUE
#> root: context_root
#> submit_or_delete: function (task_ids, name = NULL)
For documentation about the individual methods, see help for
didehpc::queue_didehpc
and in queuer (much of this needs
writing still!). For example, to see the overall cluster load you can
run:
obj$cluster_load(TRUE)
#> name free used total % used
#> --------------- ---- ---- ----- ------
#> fi--dideclusthn 134 82 216 38%
#> fi--didemrchnb 730 1050 1780 59%
#> wpia-hpc-hn 352 0 352 0%
#> --------------- ---- ---- ----- ------
#> didehpc 1216 1132 2348 48%
(if you’re on a ANSI-compatible terminal this will be in glorious Technicolor).
Before running a real job, let’s test that everything works correctly
by running the sessionInfo
command on the cluster. When run
locally, sessionInfo
prints information about the state of
your R session:
sessionInfo()
#> R version 4.0.3 (2020-10-10)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 18.04.5 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
#>
#> locale:
#> [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=C
#> [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8
#> [7] LC_PAPER=en_GB.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> loaded via a namespace (and not attached):
#> [1] zip_2.1.1 pillar_1.4.7 compiler_4.0.3 prettyunits_1.1.1
#> [5] tools_4.0.3 digest_0.6.27 pkgbuild_1.2.0 uuid_0.1-4
#> [9] lifecycle_0.2.0 jsonlite_1.7.1 evaluate_0.14 tibble_3.0.4
#> [13] pkgconfig_2.0.3 rlang_0.4.9 cli_2.2.0 filelock_1.0.2
#> [17] curl_4.3 conan_0.1.1 xfun_0.19 withr_2.3.0
#> [21] queuer_0.3.0 storr_1.2.5 httr_1.4.2 stringr_1.4.0
#> [25] knitr_1.30 vctrs_0.3.5 desc_1.2.0 askpass_1.1
#> [29] didehpc_0.3.6 context_0.3.0 rprojroot_2.0.2 glue_1.4.2
#> [33] R6_2.5.0 processx_3.4.5 rematch_1.0.1 fansi_0.4.1
#> [37] callr_3.5.1 magrittr_2.0.1 rematch2_2.1.2 ids_1.0.1
#> [41] pkgdepends_0.1.0 ps_1.4.0 ellipsis_0.3.1 assertthat_0.2.1
#> [45] lpSolve_5.6.15 stringi_1.5.3 openssl_1.4.3 crayon_1.3.4
To run this on the cluster, we wrap it in obj$enqueue
.
This prevents the evaluation of the expression and instead organises it
to be run on the cluster:
We can then poll the cluster for results until it completes:
t$wait(100)
#> (-) waiting for d826bc6...9e1, giving up in 99.5 s (\) waiting for
#> d826bc6...9e1, giving up in 98.9 s
#> R version 4.0.3 (2020-10-10)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows Server 2012 R2 x64 (build 9600)
#>
#> Matrix products: default
#>
#> locale:
#> [1] LC_COLLATE=English_United Kingdom.1252
#> [2] LC_CTYPE=English_United Kingdom.1252
#> [3] LC_MONETARY=English_United Kingdom.1252
#> [4] LC_NUMERIC=C
#> [5] LC_TIME=English_United Kingdom.1252
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> loaded via a namespace (and not attached):
#> [1] compiler_4.0.3 R6_2.5.0 context_0.3.0 digest_0.6.27 storr_1.2.5
(see the next section for more information about this).
The important part to notice here is that the R “Platform” (second and third line) is Windows Server, as opposed to the host machine which is running Linux. If we had added packages to the context they would be shown too.
Let’s run something more interesting now by running the
random_walk
function defined in the
mysources.R
file.
As above, jobs are queued by running:
Like the queue object, obj
, task objects are R6 objects
that can be used to get information and results back from the task.
t
#> <queuer_task>
#> Public:
#> clone: function (deep = FALSE)
#> context_id: function ()
#> expr: function (locals = FALSE)
#> id: c6cfa739a58b925f4919c4d4ebdead1e
#> initialize: function (id, root, check_exists = TRUE)
#> log: function (parse = TRUE)
#> result: function (allow_incomplete = FALSE)
#> root: context_root
#> status: function ()
#> times: function (unit_elapsed = "secs")
#> wait: function (timeout, time_poll = 0.5, progress = NULL)
the task’s status
…which will move from PENDING
to RUNNING
to
COMPLETE
or ERROR
. You can get information on
submission and running times
and you can try and get the result of running the task:
This errors if the task is not yet complete.
The wait
function, used above, is like
result
but it will repeatedly poll for the task to be
completed for up to timeout
seconds.
t$wait(100)
#> (-) waiting for c6cfa73...d1e, giving up in 99.5 s (\) waiting for
#> c6cfa73...d1e, giving up in 99.0 s
#> [1] -0.09103302 0.60818181 2.22847428 2.28615058 2.27823970 2.18390202
#> [7] 4.32781963 4.53920816 3.29061261 2.64571591
once the task has completed, t$result()
and
t$wait
are equivalent
t$result()
#> [1] -0.09103302 0.60818181 2.22847428 2.28615058 2.27823970 2.18390202
#> [7] 4.32781963 4.53920816 3.29061261 2.64571591
You can query the times of your tasks
t$times()
#> task_id submitted started
#> 1 c6cfa739a58b925f4919c4d4ebdead1e 2021-08-17 14:50:47 2021-08-17 14:50:48
#> finished waiting running idle
#> 1 2021-08-17 14:50:48 0.7953935 0.07812715 0.7084134
which will show you when the task was submitted, started and stopped.
Every task creates a log:
t$log()
#> [ hello ] 2021-08-17 14:50:47
#> [ wd ] Q:/didehpc/20210817-145020
#> [ init ] 2021-08-17 14:50:47.946
#> [ hostname ] FI--DIDECLUST26
#> [ process ] 3752
#> [ version ] 0.3.0
#> [ open:db ] rds
#> [ context ] 02a9261fe4e9e6554d76936ecb35cef0
#> [ library ]
#> [ namespace ]
#> [ source ] mysources.R
#> [ parallel ] running as single core job
#> [ root ] Q:\didehpc\20210817-145020\contexts
#> [ context ] 02a9261fe4e9e6554d76936ecb35cef0
#> [ task ] c6cfa739a58b925f4919c4d4ebdead1e
#> [ expr ] random_walk(0, 10)
#> [ start ] 2021-08-17 14:50:48.102
#> [ ok ]
#> [ end ] 2021-08-17 14:50:48.227
Warning messages and other output will be printed here. So if you
include message()
, cat()
or
print()
calls in your task they will appear between
start
and end
.
There is another bit of log that happens before this and contains
information about getting the system started up. You should only need to
look at this when a job seems to get stuck with status
PENDING
for ages.
obj$dide_log(t)
#> [1] "generated on host: kea"
#> [2] "generated on date: 2021-08-17"
#> [3] "didehpc version: 0.3.6"
#> [4] "context version: 0.3.0"
#> [5] "running on: FI--DIDECLUST26"
#> [6] "mapping Q: -> \\\\fi--san03.dide.ic.ac.uk\\homes\\rfitzjoh"
#> [7] "The command completed successfully."
#> [8] ""
#> [9] "mapping T: -> \\\\fi--didef3.dide.ic.ac.uk\\tmp"
#> [10] "The command completed successfully."
#> [11] ""
#> [12] "Using Rtools at T:\\Rtools\\Rtools40"
#> [13] "working directory: Q:\\didehpc\\20210817-145020"
#> [14] "this is a single task"
#> [15] "logfile: Q:\\didehpc\\20210817-145020\\contexts\\logs\\c6cfa739a58b925f4919c4d4ebdead1e"
#> [16] ""
#> [17] "Q:\\didehpc\\20210817-145020>Rscript \"Q:\\didehpc\\20210817-145020\\contexts\\bin\\task_run\" \"Q:\\didehpc\\20210817-145020\\contexts\" c6cfa739a58b925f4919c4d4ebdead1e 1>\"Q:\\didehpc\\20210817-145020\\contexts\\logs\\c6cfa739a58b925f4919c4d4ebdead1e\" 2>&1"
#> [18] "Removing mapping Q:"
#> [19] "Q: was deleted successfully."
#> [20] ""
#> [21] "Removing mapping T:"
#> [22] "T: was deleted successfully."
#> [23] ""
#> [24] "Quitting"
The queue knows which tasks it has created and you can list them:
The long identifiers are random and are long enough that collisions are unlikely.
Notice that the task ran remotely but we never had to indicate which
filename things were written to. There is a small database based on storr
that
holds all the information within the context root (here, “contexts”).
This means you can close down R and later on regenerate the
ctx
and obj
objects and recreate the task
objects, and re-get your results. But at the same time it provides the
illusion that the cluster has passed an object directly back to
you.
## [1] "40934f0a0d28ca7385b8eb201b1146b7"
There are two broad options here;
lapply
with $lapply
$enqueue_bulk
The second approach is more general and $lapply
is
implemented using it.
Suppose we want to make a bunch of trees of different sizes. This
would involve mapping our random_walk
function over a
vector of sizes:
sizes <- 3:8
grp <- obj$lapply(sizes, random_walk, x = 0)
#> Creating bundle: 'sound_nymph'
#> [ bulk ] Creating 6 tasks
#> submitting 6 tasks
#> submitting (-) [============>--------------------------] 33% | waited for 0s
#> submitting (\) [===================>-------------------] 50% | waited for 1s
#> submitting (|) [=========================>-------------] 67% | waited for 1s
#> submitting (/) [===============================>-------] 83% | waited for 2s
#> submitting (-) [=======================================] 100% | waited for 2s
By default, $lapply
returns a “task_bundle” with an
automatically generated name. You can customise the name with the
name
argument.
In contrast to lapply
this is not blocking (i.e.,
submitting tasks and collecting the results is done asynchronously) but
if you pass a timeout
argument to $lapply
then
it will poll until the jobs are done, in the same way as
wait()
, below.
Get the status of all the jobs
grp$status()
#> 8ca54f49ac82893d91c19a13891cb3ee c80f92cc3f3babc837a8ef18603db4f0
#> "COMPLETE" "COMPLETE"
#> 85935191f16cb04fd814f6df3297158c d5a841a1f7916c8a2eaf645143fe3359
#> "COMPLETE" "COMPLETE"
#> 22252a47ced26ab45e2d1387bdf4ecb8 5c4ae85ef913ce4e55010feafbb5c2a1
#> "PENDING" "PENDING"
Wait until they are all complete and get the results
res <- grp$wait(120)
#> (-) [==============================================] 100% | giving up in 119 s
The other bulk interface is where you want to run a function over a combination of parameters. Suppose we wanted to run random walks of a number of lengths from a number of starting positions, in all combinations. We might enumerate the possibilities like:
We can submit this as a group of 6 jobs with
enqueue_bulk
. Here we add the timeout
option
which makes this a blocking operation:
obj$enqueue_bulk(pars, random_walk, timeout = 120)
#> Creating bundle: 'selfish_anemoneshrimp'
#> [ bulk ] Creating 6 tasks
#> submitting 6 tasks
#> submitting (-) [============>--------------------------] 33% | waited for 1s
#> submitting (\) [===================>-------------------] 50% | waited for 1s
#> submitting (|) [=========================>-------------] 67% | waited for 2s
#> submitting (/) [===============================>-------] 83% | waited for 3s
#> submitting (-) [=======================================] 100% | waited for 3s
#> (-) [==============================================] 100% | giving up in 119 s
#> [[1]]
#> [1] -0.5092372 -0.4486443 -0.6431714 0.1341441 -1.1818970
#>
#> [[2]]
#> [1] -1.18855584 -0.02861008 0.82356347 1.51119207 0.35416280
#>
#> [[3]]
#> [1] 1.850285 2.423607 1.597023 1.560501 2.187977
#>
#> [[4]]
#> [1] -1.0831528 -4.0599794 -2.9609470 -2.1744547 -0.1222570 -0.8703516
#> [7] -0.9264912 -0.5261700 1.2054075 0.2159083
#>
#> [[5]]
#> [1] 0.72165118 0.36474247 0.55938912 -0.01860092 0.61308581 -0.80447926
#> [7] 0.41558732 0.51387762 0.83248026 0.44125104
#>
#> [[6]]
#> [1] 2.0346518 3.2722265 3.4380578 5.1838941 3.7677278 1.9371159 1.5488215
#> [8] 1.7675202 0.7970567 1.1976480
This has applied the function random_walk
over each row
of pars
.
Suppose you fire off a bunch of jobs and realise that you have the wrong data or they’re all going to fail - you can stop them fairly easily.
Here’s a job that will run for an hour and return nothing:
Wait for the job to start up:
Now that it’s started it can be cancelled with the
$unsubmit
method:
unsubmitting multiple times is safe, and will have no effect.
Alternatively you can use obj$task_delete(t$id)
which
unsubmits the task and then deletes it.
Note that the task is not actually deleted (see below); you can still get at the expression:
but you cannot retrieve results:
The argument to unsubmit
can be a vector. For example,
if you had a task bundle grp
you could unsubmit all members
of the group with
Deleting tasks is supported but it isn’t entirely encouraged. Not all of the functions behave well with missing tasks, so if you delete things and still have old task handles floating around you might get confusing results.
There is a delete method (obj$task_delete
) that will
delete jobs, first unsubmitting it if it has been submitted. It takes a
vector of task ids as an argument.
If you are running tasks that can use more than one core, you can
request more resources for your task and use process level parallelism
with the parallel
package. To request 8 cores, you could
run:
When your task starts, 8 cores will be allocated to it and a
parallel
cluster will be created. You can use it with
things like parallel::parLapply
, specifying cl
as NULL
. So if within your cluster job you needed to apply
function f
to a each element of a list x
, you
could write:
The parallel bits can be embedded within larger blocks of code. All
functions in parallel
that take cl
as a first
argument can be used. You do not need to (and should not) set up the
cluster as this will happen automatically as the job starts.
Alternatively, if you want to control cluster creation (e.g., you are
using software that does this for you) then, pass
parallel = FALSE
to the config call:
In this case you are responsible for setting up the cluster.
As an alternative to requesting cores, you can use a different job template:
which will reserve you the entire node. Again, a cluster will be
started with all available cores unless you also specify
parallel = FALSE
.
If you have thousands and thousands of jobs to submit at once you may not want to flood the cluster with them all at once. Each job submission is relatively slow (the HPC tools that the web interface has to use are relatively slow). The actual queue that the cluster uses doesn’t seem to like processing tens of thousands of job, and can slow down. And if you take up the whole cluster someone may come and knock on your office and complain at you. At the same time, batching your jobs up into little bits and manually sending them off is a pain and work better done by a computer.
An alternative is to submit a set of “workers” to the cluster, and
then submit jobs to them. This is done with the rrq
package,
along with a redis
server
running on the cluster.
See the “workers” vignette for details.
For all operating systems, if you are on the wireless network you will need to connect to the VPN. If you can get on a wired network you’ll likely have a better time because the VPN and wireless network seems less stable in general. Instructions for setting up a VPN are here
Your network drives are likely already mapped for you. In fact you
should not even need to map drives as fully qualified network names
(e.g. //fi--didef3/tmp
) should work for you.
In Finder, go to Go -> Connect to Server...
or press
Command-K
. In the address field write the name of the share
you want to connect to. Useful ones are
smb://fi--san03.dide.ic.ac.uk/homes/<username>
–
your home sharesmb://fi--didef3.dide.ic.ac.uk/tmp
– the temporary
shareAt some point in the process you should get prompted for your username and password, but I can’t remember what that looks like.
These directories will be mounted at
/Volumes/<username>
and /Volumes/tmp
(so
the last bit of the filename will be used as the mountpoint within
Volumes
). There may be a better way of doing this, and the
connection will not be reestablished automatically so if anyone has a
better way let me know.
This is what I have done for my computer and it seems to work, though it’s not incredibly fast. Full instructions are on the Ubuntu community wiki.
First, install cifs-utils
sudo apt-get install cifs-utils
In your /etc/fstab
file, add
//fi--san03/homes/<dide-username> <home-mount-point> cifs uid=<local-userid>,gid=<local-groupid>,credentials=/home/<local-username>/.smbcredentials,domain=DIDE,sec=ntlmssp,iocharset=utf8 0 0
//fi--didef3/tmp <tmp-mount-point> cifs uid=<local-userid>,gid=<local-groupid>,credentials=/home/<local-username>/.smbcredentials,domain=DIDE,vers=2.0,sec=ntlmssp,iocharset=utf8 0 0
where:
<dide-username>
is your dide username without the
DIDE\
bit.<local-username>
is your local username (i.e.,
echo $USER
).<local-userid>
is your local numeric user id
(i.e. id -u $USER
)<local-groupid>
is your local numeric group id
(i.e. id -g $USER
)<home-mount-point>
is where you want your DIDE
home directory mounted<tmp-mount-point>
is where you want the DIDE
temporary directory mountedplease back this file up before editing.
So for example, I have:
//fi--san03/homes/rfitzjoh /home/rich/net/home cifs uid=1000,gid=1000,credentials=/home/rich/.smbcredentials,domain=DIDE,sec=ntlmssp,iocharset=utf8 0 0
//fi--didef3/tmp /home/rich/net/temp cifs uid=1000,gid=1000,credentials=/home/rich/.smbcredentials,domain=DIDE,sec=ntlmssp,iocharset=utf8 0 0
The file .smbcredentials
contains
username=<dide-username>
password=<dide-password>
and set this to be chmod 600 for a modicum of security, but be aware your password is stored in plaintext.
This set up is clearly insecure. I believe if you omit the credentials line you can have the system prompt you for a password interactively, but I’m not sure how that works with automatic mounting.
Finally, run
mount -a
to mount all drives and with any luck it will all work and you don’t have to do this until you get a new computer.