Oh no!
If your task status is failure
that probably
indicates an error in your code. There are lots of reasons that this
could be for, and the first challenge is working out what happened.
id <- task_create_expr(mysimulation(10))
#> ✔ Submitted task '8949ad7d502bf88cae562298fe952e19' using 'example'
This task will fail, and task_status()
will report
failure
The first place to look is the result of the task itself. Unlike an error in your console, an error that happens on the cluster can be returned and inspected:
In this case the error is because the function
mysimulation
does not exist! This is because we’ve
forgotten to tell the cluster where to find it.
The other place worth looking is the task log (via
task_log_show()
), which provides more diagnostic
information. We will often ask you to show this to us.
task_log_show(id)
#>
#> ── hipercow 1.0.52 running at '/tmp/Rtmpehsscb/hv-20241209-12d89f422b8' ────────
#> ℹ library paths:
#> • /tmp/RtmpxLCJQj/Rinst12221aef0f75
#> • /github/workspace/pkglib
#> • /usr/local/lib/R/site-library
#> • /usr/lib/R/site-library
#> • /usr/lib/R/library
#> ℹ id: 8949ad7d502bf88cae562298fe952e19
#> ℹ starting at: 2024-12-09 19:22:48.973767
#> ℹ Task type: expression
#> • Expression: mysimulation(10)
#> • Locals: (none)
#> • Environment: default
#> R_GC_MEM_GROW: 3
#> ───────────────────────────────────────────────────────────────── task logs ↓ ──
#>
#> ───────────────────────────────────────────────────────────────── task logs ↑ ──
#> ✖ status: failure
#> ✖ Error: could not find function "mysimulation"
#> ℹ finishing at: 2024-12-09 19:22:48.973767 (elapsed: 0.3746 secs)
In this case the task log does not have anything very interesting in it.
Here’s another example, something that might work perfectly well on your machine, but fails on the cluster:
id <- task_create_expr(read.csv("c:/myfile.csv"))
#> ✔ Submitted task 'eeaed447445ee75d6847cf89c0478955' using 'example'
Here is the error, which is a bit less informative this time:
The log gives a better idea of what is going on - the file
c:/myfile.csv
does not exist (because it is not found on
the cluster; using relative paths is much preferred to absolute
paths)
task_log_show(id)
#>
#> ── hipercow 1.0.52 running at '/tmp/Rtmpehsscb/hv-20241209-12d89f422b8' ────────
#> ℹ library paths:
#> • /tmp/RtmpxLCJQj/Rinst12221aef0f75
#> • /github/workspace/pkglib
#> • /usr/local/lib/R/site-library
#> • /usr/lib/R/site-library
#> • /usr/lib/R/library
#> ℹ id: eeaed447445ee75d6847cf89c0478955
#> ℹ starting at: 2024-12-09 19:22:49.980328
#> ℹ Task type: expression
#> • Expression: read.csv("c:/myfile.csv")
#> • Locals: (none)
#> • Environment: default
#> R_GC_MEM_GROW: 3
#> ───────────────────────────────────────────────────────────────── task logs ↓ ──
#>
#> ───────────────────────────────────────────────────────────────── task logs ↑ ──
#> ✖ status: failure
#> ✖ Error: cannot open the connection
#> ! 1 warning found:
#> • cannot open file 'c:/myfile.csv': No such file or directory
#> ℹ finishing at: 2024-12-09 19:22:49.980328 (elapsed: 0.3354 secs)
The real content of the error message is present in the warning! You can also get the warnings with
task_result(id)$warnings
#> [[1]]
#> <simpleWarning in file(file, "rt"): cannot open file 'c:/myfile.csv': No such file or directory>
Which will be a list of all warnings generated during the execution of your task (even if it succeeds). The traceback also shows what happened:
task_result(id)$trace
#> ▆
#> 1. ├─rlang::try_fetch(...)
#> 2. │ ├─base::tryCatch(...)
#> 3. │ │ └─base (local) tryCatchList(expr, classes, parentenv, handlers)
#> 4. │ │ └─base (local) tryCatchOne(expr, names, parentenv, handlers[[1L]])
#> 5. │ │ └─base (local) doTryCatch(return(expr), name, parentenv, handler)
#> 6. │ └─base::withCallingHandlers(...)
#> 7. ├─hipercow:::task_eval_expression(data, envir, verbose)
#> 8. │ ├─hipercow:::eval_with_hr(...)
#> 9. │ │ └─base::force(expr)
#> 10. │ └─base::eval(data$expr, envir)
#> 11. │ └─base::eval(data$expr, envir)
#> 12. ├─utils::read.csv("c:/myfile.csv")
#> 13. │ └─utils::read.table(...)
#> 14. │ └─base::file(file, "rt")
#> 15. └─base::.handleSimpleError(...)
#> 16. └─rlang (local) h(simpleError(msg, call))
#> 17. └─handlers[[2L]](cnd)
These are harder to troubleshoot but we can still pull some information out. The example here was a real-world case and illustrates one of the issues with using a shared filesystem in the way that we do here.
Suppose you have some code in mycode.R
:
We can create an environment with this code and use it just fine:
hipercow_environment_create(sources = "mycode.R")
#> ✔ Created environment 'default'
id <- task_create_expr(times2(10))
#> ✔ Submitted task 'a7809611452301408f3aa84299827a05' using 'example'
task_wait(id)
#> [1] TRUE
task_result(id)
#> [1] 20
…but imagine then you’re editing the file and save the file but it is not syntactically correct:
And then you either submit a task, or a task that you have previously submitted gets run (which could happen ages after you submit it if the cluster is busy).
id <- task_create_expr(times2(10))
#> ✔ Submitted task 'ebdf155647c01bd2037f90eb2aae16e1' using 'example'
task_wait(id)
#> [1] FALSE
task_status(id)
#> [1] "failure"
The error here has happened before getting to your code - it is happening when the source files are loaded. The log makes this a bit clearer:
task_log_show(id)
#>
#> ── hipercow 1.0.52 running at '/tmp/Rtmpehsscb/hv-20241209-12d89f422b8' ────────
#> ℹ library paths:
#> • /tmp/RtmpxLCJQj/Rinst12221aef0f75
#> • /github/workspace/pkglib
#> • /usr/local/lib/R/site-library
#> • /usr/lib/R/site-library
#> • /usr/lib/R/library
#> ℹ id: ebdf155647c01bd2037f90eb2aae16e1
#> ℹ starting at: 2024-12-09 19:22:52.122022
#> ℹ Task type: expression
#> • Expression: times2(10)
#> • Locals: (none)
#> • Environment: default
#> R_GC_MEM_GROW: 3
#> ℹ Loading environment 'default'...
#> • packages: (none)
#> • sources: mycode.R
#> • globals: (none)
#> ✖ status: failure
#> ✖ Error: 5:0: unexpected end of input 3: } 4: newfun <- function(x) ^
#> ℹ finishing at: 2024-12-09 19:22:52.122022 (elapsed: 0.3666 secs)
submitted
(Previous users of didehpc
may recognise this as being
stuck at PENDING
).
This is the most annoying one, and can happen for many reasons. You
can see via the web interface
or the Microsoft cluster tools that your task has failed but
hipercow
is reporting it as pending. This happens when
something has failed during the script that runs before any
hipercow
code runs on the cluster.
Things that have triggered this situation in the past:
There are doubtless others.
If you suspect your task has become stuck at submitted
(but is not actually running any more) you should try one or more
of:
task_info()
with your id which will fetch this true
id and tell you about any discrepancytask_log_show(id, outer = TRUE)
) which will show you the
scheduler’s logs for this task, which may be informative.In that case, something is different between how the cluster sees the world, and how your computer sees it.
C:
for
instance?Error allocating a vector...
or std::bad_alloc
, then try and work out the memory usage
of a single task. Perhaps run it locally with task manager (Windows), or
top
/htop
(macOS/Linux) running, and watch to
see what the memory usage is. If the task is single-core, consider the
total memory used if you run 8 or 16 instances on the same cluster
machine. If the total memory exceeds the available, then behaviour will
be undefined, and some tasks will likely fail.There are lots of possible causes of this, and ways that this might manifest as an error message, for example:
Error in client_parse_submit(httr_text(r), 1L) :
Job submission has likely failed; could be a login error
(we will add other error messages here as we catch them).
By the time you get here, we’ve thrown a pretty generic error because for some reason we can’t tell what has happened. Possible reasons that you might see an error like this:
You can check most of these by running
which will work through many common points of failure and report back what does and does not work. If you want help with diagnosing this sort of error, we would expect to see output from this command.
If that does work, but you are still having what looks like connection problems, then try
which will launch a simple job. If this does not work, and you want to ask for help, we would like to see the whole output of this command.
If that works, but your actual job does not work, then something about what you are submitting is causing the problem. In this case, if you are asking for help, we would need to know something about your code, in which case read on for the next section.
If you need help, you can ask in the “Cluster” teams channel. This is better than emailing Rich or Wes directly as they may not have time to respond, or may be on leave.
When asking for help it is really important that you make it as easy as possible for us to help you. This is surprisingly hard to do well, and we would ask that you first take a look at these two short articles:
Things we will need to know:
hipercow::hipercow_configuration()
Too often, we will get requests for help with no information about what was run, what packages or versions are being installed, etc. This means your message sits there until we see it, we’ll ask for clarification - that message sits there until you see it, you respond with a little more information, and it may be days until we finally discover the root cause of your problem, by which point we’re both quite fed up. We will never complain if you provide “too much” information in a good effort to outline where your problem is.
Don’t say
Hi, I was running a cluster task, but it seems like it failed. I’m sure it worked the other day though! Do you know what the problem is?
Do say
Since yesterday, my cluster task has stopped working.
My DIDE username is
alicebobson
and my configuration is:-- hipercow configuration ----- [etc]
I have set up my cluster task with
# include short script here if you can!
The task
43333cbd79ccbf9ede79556b592473c8
is one that failed with an error, and the log says# contents of task_log_show(id) here
with this sort of information the problem may just jump out at us, or we may be able to create the error ourselves - either way we may be able to work on the problem and get back to you with a solution rather than a request for more information.
Other tips, and reasons you may have been directed to this page:
We do want to help, but expect slower responses where we have to do lots of discovery to find out what your problem is, it will take longer until we find the time and energy to start digging. The more information you provide, the more likely it is we can spot the error.