Title: | DIDE HPC Support |
---|---|
Description: | DIDE HPC support. |
Authors: | Rich FitzJohn [aut, cre], Wes Hinsley [aut], Imperial College of Science, Technology and Medicine [cph] |
Maintainer: | Rich FitzJohn <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.3.22 |
Built: | 2024-10-26 02:55:11 UTC |
Source: | https://github.com/mrc-ide/didehpc |
Overall cluster load for all clusters that you have access to.
cluster_load(credentials = NULL)
cluster_load(credentials = NULL)
credentials |
Your credentials |
Collects configuration information. Unfortunately there's a fairly complicated process of working out what goes where so documentation coming later.
didehpc_config( credentials = NULL, home = NULL, temp = NULL, cluster = NULL, shares = NULL, template = NULL, cores = NULL, wholenode = NULL, parallel = NULL, workdir = NULL, use_workers = NULL, use_rrq = NULL, worker_timeout = NULL, worker_resource = NULL, conan_bootstrap = NULL, r_version = NULL, use_java = NULL, java_home = NULL ) didehpc_config_global(..., check = TRUE)
didehpc_config( credentials = NULL, home = NULL, temp = NULL, cluster = NULL, shares = NULL, template = NULL, cores = NULL, wholenode = NULL, parallel = NULL, workdir = NULL, use_workers = NULL, use_rrq = NULL, worker_timeout = NULL, worker_resource = NULL, conan_bootstrap = NULL, r_version = NULL, use_java = NULL, java_home = NULL ) didehpc_config_global(..., check = TRUE)
credentials |
Either a list with elements username, password,
or a path to a file containing lines |
home |
Path to network home directory, on local system |
temp |
Path to network temp directory, on local system |
cluster |
Name of the cluster to use; one of
|
shares |
Optional additional share mappings. Can either be a
single path mapping (as returned by |
template |
A job template. On fi–dideclusthn this can be "GeneralNodes" or "8Core". On "fi–didemrchnb" this can be "GeneralNodes", "12Core", "16Core", "12and16Core", "20Core", "24Core", "32Core", or "MEM1024" (for nodes with 1Tb of RAM; we have three - two of which have 32 cores, and the other is the AMD epyc with 64). On the new "wpia-hn" cluster, you should currently use "AllNodes". See the main cluster documentation if you tweak these parameters, as you may not have permission to use all templates (and if you use one that you don't have permission for the job will fail). For training purposes there is also a "Training" template, but you will only need to use this when instructed to. |
cores |
The number of cores to request. If
specified, then we will request this many cores from the windows
queuer. If you request too many cores then your task will queue
forever! 24 is the largest this can be on fi–dideclusthn. On fi–didemrchnb,
the GeneralNodes template has mainly 20 cores or less, with a single 64 core
node, and the 32Core template has 32 core nodes. On wpia-hn, all the nodes are
32 core. If |
wholenode |
If TRUE, request exclusive access to whichever compute node is allocated to the job. Your code will have access to all the cores and memory on the node. |
parallel |
Should we set up the parallel cluster? Normally
if more than one core is implied (via the |
workdir |
The path to work in on the cluster, if running out of place. |
use_workers |
Submit jobs to an internal queue, and run them
on a set of workers submitted separately? If |
use_rrq |
Use |
worker_timeout |
When using workers (via |
worker_resource |
Optionally, an object created by
|
conan_bootstrap |
Logical, indicating if we should use the
shared conan "bootstrap" library stored on the temporary
directory. Setting this to |
r_version |
A string, or |
use_java |
Logical, indicating if the script is going to require Java, for example via the rJava package. |
java_home |
A string, optionally giving the path of a custom Java Runtime Environment, which will be used if the use_java logical is true. If left blank, then the default cluster Java Runtime Environment will be used. |
... |
arguments to |
check |
Logical, indicating if we should check that the configuration object can be created |
If you need more than one core per task (i.e., you want the each task to do some parallel processing in addition to the parallelism between tasks) you can do that through the configuration options here.
The template
option chooses among templates defined on the
cluster.
If you specify cores
, the HPC will queue your job until an
appropriate number of cores appears for the selected template.
This can leave your job queuing forever (e.g., selecting 20 cores
on a 16Core template) so be careful.
Alternatively, if you specify wholenode
as TRUE, then you will
have exclusive access to whichever compute node is allocated
to your job, reserving all of its cores.
If more than 1 core is requested, either by choosing
wholenode
, or by specifying a
cores
value greater than 1) on startup, a parallel
cluster will be started, using parallel::makePSOCKcluster
and this will be registered as the default cluster. The nodes
will all have the appropriate context loaded and you can
immediately use them with parallel::clusterApply
and
related functions by passing NULL
as the first argument.
The cluster will be shut down politely on exit, and logs will be
output to the "workers" directory below your context root.
The options use_workers
and use_rrq
interact, share
some functionality, but are quite different.
With use_workers
, jobs are never submitted when you run
enqueue
or one of the bulk submission commands in
queuer
. Instead you submit workers using
submit_workers
and then the submission commands push task
ids onto a Redis queue that the workers monitor.
With use_rrq
, enqueue
etc still work as before, plus
you must submit workers with submit_workers
. The
difference is that any job may access the rrq_controller
and push jobs onto a central pool of tasks.
I'm not sure at this point if it makes any sense for the two approaches to work together so this is disabled for now. If you think you have a use case please let me know.
Describe a path mapping for use when setting up jobs on the cluster.
path_mapping(name, path_local, path_remote, drive_remote)
path_mapping(name, path_local, path_remote, drive_remote)
name |
Name of this map. Can be anything at all, and is used for information purposes only. |
path_local |
The point where the drive is attached locally.
On Windows this will be something like "Q:/", on Mac something
like "/Volumes/mountname", and on Linux it could be anything at
all, depending on what you used when you mounted it (or what is
written in |
path_remote |
The network path for this drive. It
will look something like |
drive_remote |
The place to mount the drive on the cluster. We're probably going to mount things at Q: and T: already so don't use those. And things like C: are likely to be used. Perhaps there are some guidelines for this somewhere? |
Rich FitzJohn
Create a queue object. This is an R6::R6Class object which you interact with by calling "methods" which are described below, and on the help page for queuer::queue_base, from which this derives.
queue_didehpc( context, config = didehpc_config(), root = NULL, initialise = TRUE, provision = NULL, login = NULL )
queue_didehpc( context, config = didehpc_config(), root = NULL, initialise = TRUE, provision = NULL, login = NULL )
context |
A context |
config |
Optional dide configuration information. |
root |
A root directory, not usually needed |
initialise |
Passed through to the base queue. If you set
this to |
provision |
A provisioning strategy to use. Options are
|
login |
Logical, indicating if we should immediately
login. If |
queuer::queue_base
-> queue_didehpc
config
Your didehpc_config()
for this queue.
Do not change this after queue creation as changes may not
take effect as expected.
client
A web_client object used to communicate with the web portal. See the help page for its documentation, but you will typically not need to interact with this.
queuer::queue_base$enqueue()
queuer::queue_base$enqueue_()
queuer::queue_base$enqueue_bulk()
queuer::queue_base$initialize_context()
queuer::queue_base$lapply()
queuer::queue_base$mapply()
queuer::queue_base$task_bundle_get()
queuer::queue_base$task_bundle_info()
queuer::queue_base$task_bundle_list()
queuer::queue_base$task_bundle_retry_failed()
queuer::queue_base$task_delete()
queuer::queue_base$task_get()
queuer::queue_base$task_list()
queuer::queue_base$task_result()
queuer::queue_base$task_retry_failed()
queuer::queue_base$task_status()
queuer::queue_base$task_times()
new()
Constructor
queue_didehpc_$new( context, config, root, initialise, provision, login, client = NULL )
context, config, root, initialise, provision, login
See above
client
A web_client object, primarily useful for testing the package
login()
Log onto the web portal. This will be called automatically at either when creating the object (by default) or when you make your first request to the portal. However, you can call this to refresh the session too.
queue_didehpc_$login(refresh = TRUE)
refresh
Logical, indicating if we should try logging on again, even if it looks like we already have. This will refresh the session, which is typically what you want to do.
cluster_load()
Report on the overall cluster usage
queue_didehpc_$cluster_load(cluster = NULL, nodes = TRUE)
cluster
Cluster to show; if TRUE
show the entire cluster
(via load_overall
), if NULL
defaults to the value
config$cluster
nodes
Show the individual nodes when printing
reconcile()
Attempt to reconcile any differences in task state
between our database and the HPC queue. This is needed when
tasks have crashed, or something otherwise bad has happened
and you have tasks stuck in PENDING
or RUNNING
that are
clearly not happy. This function does not play well with workers and
you should not use it if using them.
queue_didehpc_$reconcile(task_ids = NULL)
task_ids
A vector of tasks to check
submit()
Submit a task to the queue. Ordinarily you do not call
this directly, it is called by the $enqueue()
method of
queuer::queue_base when you create a task. However, you can
use this to resubmit a task that has failed if you think it will
run successfully a second time (e.g., because you cancelled it
the first time around).
queue_didehpc_$submit(task_ids, names = NULL, depends_on = NULL)
task_ids
A vector of task identifiers to submit.
names
Optional names for the tasks.
depends_on
Optional vector of dependencies, named by task id
submit_workers()
Submit workers to the queue. This only works if
use_rrq
or use_workers
is TRUE
in your configuration.
See vignette("workers")
for more information.
queue_didehpc_$submit_workers(n, timeout = 600, progress = NULL)
n
The number of workers to submit
timeout
The time to wait, in seconds, for all workers to come online. An error will be thrown if this time is exceeded.
progress
Logical, indicating if a progress bar should be printed while waiting for workers.
stop_workers()
Stop workers running on the cluster. See
vignette("workers")
for more information. By default
workers will timeout after 10 minutes of inactivity.
queue_didehpc_$stop_workers(worker_ids = NULL)
worker_ids
Vector of worker names to try and stop. By default all workers are stopped.
rrq_controller()
Return an rrq::rrq_controller object, if you have
set use_rrq
or use_workers
in your configuration. This is
a lightweight queue using your workers which is typically much
faster than submitting via $enqueue()
. See vignette("workers")
for more information.
queue_didehpc_$rrq_controller()
unsubmit()
Unsubmit tasks from the cluster. This removes the tasks from the queue if they have not been started yet, and stops them if currently running. It will have no effect if the tasks are completed (successfully or errored)
queue_didehpc_$unsubmit(task_ids)
task_ids
Can be a task id (string), a vector of task ids, a task, a list of tasks, a bundle returned by enqueue_bulk, or a list of bundles.
dide_id()
Find the DIDE task id of your task. This is the number displayed in the web portal.
queue_didehpc_$dide_id(task_ids)
task_ids
Vector of task identifiers to look up
dide_log()
Return the pre-context log of a task. Use this to find
out what has happened to a task that has unexpectedly failed, but
for which $log()
is uninformative.
queue_didehpc_$dide_log(task_id)
task_id
A single task id to check
provision_context()
Provision your context for running on the cluster.
This sets up the remote set of packages that your tasks will use.
See vignette("packages")
for more information.
queue_didehpc_$provision_context( policy = "verylazy", dryrun = FALSE, quiet = FALSE, show_progress = NULL, show_log = TRUE )
policy
The installation policy to use, as interpreted by
pkgdepends::pkg_solution
- so this should be verylazy
/lazy
(install missing packages but don't upgrade unless needed) or
upgrade
(upgrade packages as possible). In addition you can
also use later
which does nothing, or fake
which pretends
that it ran the provisioning. See vignette("packages")
for
details on these options.
dryrun
Do a dry run installation locally - this just checks that your requested set of packages is plausible, but does this without submitting a cluster job so it may be faster.
quiet
Logical, controls printing of informative messages
show_progress
Logical, controls printing of a spinning progress bar
show_log
Logical, controls printing of the log from the cluster
install_packages()
Install packages on the cluster. This can be used to
more directly install packages on the cluster than the
$provision_context
method that you would typically use.
See vignette("packages")
for more information.
queue_didehpc_$install_packages( packages, repos = NULL, policy = "lazy", dryrun = FALSE, show_progress = NULL, show_log = TRUE )
packages
A character vector of packages to install. These
can be names of CRAN packages or GitHub references etc; see
pkgdepends::new_pkg_installation_proposal()
and
vignette("packages")
for more details
repos
A character vector of repositories to use when installing. A suitable CRAN repo will be added if not detected.
policy
The installation policy to use, as interpreted by
pkgdepends::pkg_solution
- so this should be lazy
(install missing packages but don't upgrade unless needed) or
upgrade
(upgrade packages as possible). In addition you can
also use later
which does nothing, or fake
which pretends
that it ran the provisioning. See vignette("packages")
for
details on these options.
dryrun
Do a dry run installation locally - this just checks that your requested set of packages is plausible, but does this without submitting a cluster job so it may be faster.
show_progress
Logical, controls printing of a spinning progress bar
show_log
Logical, controls printing of the log from the cluster
DIDE cluster web client
DIDE cluster web client
Client for the DIDE cluster web interface.
new()
Create an API client for the DIDE cluster
web_client$new( credentials = NULL, cluster_default = "fi--dideclusthn", login = FALSE, client = NULL )
credentials
Either your username, or a list with at least
the element username
and possibly the name 'password. If not
given, your password will be prompted for at login.
cluster_default
The default cluster to use; this can be overridden in any command.
login
Logical, indicating if we should immediately login
client
Optional API client object - if given then we prefer this object rather than trying to create a new client with the given credentials.
login()
Log in to the cluster
web_client$login(refresh = TRUE)
refresh
Logical, indicating if we should login even if it looks like we are already (useful if login has expired)
logout()
Log the client out
web_client$logout()
logged_in()
Test whether the client is logged in, returning TRUE
or FALSE
.
web_client$logged_in()
check_access()
Validate that we have access to a given cluster
web_client$check_access(cluster = NULL)
cluster
The name of the cluster to check, defaulting to the value given when creating the client.
submit()
Submit a job to the cluster
web_client$submit( path, name, template, cluster = NULL, resource_type = "Cores", resource_count = 1, depends_on = NULL )
path
The path to the job to submit. This must be a windows (UNC) network path, starting with two backslashes, and must be somewhere that the cluster can see.
name
The name of the job (will be displayed in the web interface).
template
The name of the template to use.
cluster
The cluster to submit to, defaulting to the value given when creating the client.
resource_type
The type of resource to request (either Cores
or Nodes
)
resource_count
The number of resources to request
depends_on
Optional. A vector of dide ids that this job depends on.
cancel()
Cancel a cluster task
web_client$cancel(dide_id, cluster = NULL)
dide_id
The DIDE task id for the task
cluster
The cluster that the task is running on, defaulting to the value given when creating the client.
A named character vector with a status reported by the
cluster head node. Names will be the values of dide_id
and values one of OK
, NOT_FOUND
, WRONG_USER
, WRONG_STATE
,
ID_ERROR
log()
Get log from job
web_client$log(dide_id, cluster = NULL)
dide_id
The DIDE task id for the task
cluster
The cluster that the task is running on, defaulting to the value given when creating the client.
status_user()
Return status of all your jobs
web_client$status_user(state = "*", cluster = NULL)
state
The state the job is in. Can be one of Running
,
Finished
, Queued
, Failed
or Cancelled
. Or give *
for all states (this is the default).
cluster
The cluster to query, defaulting to the value given when creating the client.
status_job()
Return status of a single job
web_client$status_job(dide_id, cluster = NULL)
dide_id
The id of the job - this will be an integer
cluster
The cluster to query, defaulting to the value given when creating the client.
load_node()
Return an overall measure of cluster use, one entry per node within a cluster.
web_client$load_node(cluster = NULL)
cluster
The cluster to query, defaulting to the value given when creating the client.
load_overall()
Return an overall measure of cluster use, one
entry per cluster that you have access to.
Helper function; wraps around load_overall
and load_node
and
always shows the output.
web_client$load_overall()
load_show()
web_client$load_show(cluster = NULL, nodes = TRUE)
cluster
Cluster to show; if TRUE
show the entire cluster
(via load_overall
), if NULL
defaults to the value given when the
client was created.
nodes
Show the nodes when printing
headnodes()
Return a vector of known cluster headnodes. Typically
valid_clusters()
will be faster. This endpoint can
be used as a relatively fast "ping" to check that you are
logged in the client and server are talking properly.
web_client$headnodes(forget = FALSE)
forget
Logical, indicating we should re-fetch the value from the server where we have previously fetched it.
r_versions()
Return a vector of all available R versions
web_client$r_versions()
api_client()
Returns the low-level API client for debugging
web_client$api_client()
Test cluster login
web_login(credentials = NULL)
web_login(credentials = NULL)
credentials |
Your credentials |
Specify resources for worker processes. If given, the values here
will override those in didehpc_config()
. See
vignette("workers")
for more details.
worker_resource( template = NULL, cores = NULL, wholenode = NULL, parallel = NULL )
worker_resource( template = NULL, cores = NULL, wholenode = NULL, parallel = NULL )
template |
A job template. On fi–dideclusthn this can be "GeneralNodes" or "8Core". On "fi–didemrchnb" this can be "GeneralNodes", "12Core", "16Core", "12and16Core", "20Core", "24Core", "32Core", or "MEM1024" (for nodes with 1Tb of RAM; we have three - two of which have 32 cores, and the other is the AMD epyc with 64). On the new "wpia-hn" cluster, you should currently use "AllNodes". See the main cluster documentation if you tweak these parameters, as you may not have permission to use all templates (and if you use one that you don't have permission for the job will fail). For training purposes there is also a "Training" template, but you will only need to use this when instructed to. |
cores |
The number of cores to request. If
specified, then we will request this many cores from the windows
queuer. If you request too many cores then your task will queue
forever! 24 is the largest this can be on fi–dideclusthn. On fi–didemrchnb,
the GeneralNodes template has mainly 20 cores or less, with a single 64 core
node, and the 32Core template has 32 core nodes. On wpia-hn, all the nodes are
32 core. If |
wholenode |
If TRUE, request exclusive access to whichever compute node is allocated to the job. Your code will have access to all the cores and memory on the node. |
parallel |
Should we set up the parallel cluster? Normally
if more than one core is implied (via the |
A list with class worker_resource
which can be passed
into didehpc_config