Skip to content

Draft: Flirting with workflow recovery

Charlotte Hausman requested to merge reacquire_monitor into main

The workflow service doesn't currently recover running workflows when the container is restarted. I'm fairly sure that is due to loosing the wf_monitor process when the container goes down. Since the request is already running in HTCondor at that point, the request completes, but Workspaces never finds out due to the lost monitor process.

If this works, it would search out, and restart a monitor process for, any workflow requests that appear to still be in-flight on container restart or deployment.

Merge request reports