Nathan Hertz requested to merge WS-807-state-machine-audit into main Dec 03, 2021

Changes

Added documentation page detailing state machine definitions for the current system
Added state machine definitions to architecture table of contents
Added globbing to all relevant tables of contents for ease of maintenance
Added missing immutable_views requirement
Get feedback and advice on state machines and state machine templating idea

State machine documentation page

Capability State Machines

With the introduction of the state machine system, all of our capabilities are now defined by a state machine. Below will be definitions for the state machine for each capability in the system.

Legend

States
===================
[State Name]


Transitions/Actions
===================
- <pattern> {Action 1} {Action 2} ... {Action N} ->

`null`

[Start] - <capability-submitted> {QueueWorkflow null} -> [Queued]
[Queued] - <resume> {ExecuteWorkflow null} -> [Executing]
[Executing] - <workflow-complete> {SendMessage execution_complete} {SendNotification workflow-complete} -> [Complete]
[Executing] - <workflow-failed> {SendMessage execution_failed} {SendNotification workflow-failed} -> [Failed]

`null_dag`

[Start] - <capability-submitted> {QueueWorkflow null_dag} -> [Queued]
[Queued] - <resume> {ExecuteWorkflow null_dag} -> [Executing]
[Executing] - <workflow-complete> {SendMessage execution_complete} {SendNotification workflow-complete} -> [Complete]
[Executing] - <workflow-failed> {SendMessage execution_failed} {SendNotification workflow-failed} -> [Failed]

`test_download`

[Start] - <capability-submitted> {QueueWorkflow test_download} -> [Queued]
[Queued] - <resume> {ExecuteWorkflow test_download} -> [Executing]
[Executing] - <workflow-complete> {SendMessage execution_complete} {SendNotification workflow-complete} -> [Complete]
[Executing] - <workflow-failed> {SendMessage execution_failed} {SendNotification workflow-failed} -> [Failed]

`std_calibration`

-- Current
[Start] - <capability-submitted> {QueueWorkflow std_calibration} -> [Queued]
[Queued] - <resume> {ExecuteWorkflow std_calibration} -> [Executing]
[Executing] - <workflow-complete> {SendMessage execution_complete} {SendNotification workflow-complete} -> [Awaiting QA]
[Executing] - <workflow-failed> {SendMessage execution_failed} {SendNotification workflow-failed} -> [Failed]
[Awaiting QA] - <qa-pass> {QaPass} -> [Complete]
[Awaiting QA] - <qa-fail> {no action} -> [Failed]

-- Ideal
[Start] - <capability-submitted> {QueueWorkflow std_calibration} -> [Queued]
[Queued] - <resume> {ExecuteWorkflow std_calibration} -> [Executing]
[Executing] - <workflow-complete> {AnnounceQa} -> [Awaiting QA]
[Executing] - <workflow-failed> {SendMessage execution_failed} {SendNotification workflow-failed} -> [Failed]
[Awaiting QA] - <qa-pass> {QaPass} -> [Ingesting]
-- Should we run QaFail on other versions after QaPassing one? Yes, we should. But we need to be able to execute workflows on specific versions (WS-799)
[Awaiting QA] - <qa-fail> {QaFail} -> [Failed]
-- Make sure child workflow ID and parent workflow ID are checked and aligned properly with each other; compare parent ID with workflow ID of execution's current WF ID; this should match them properly
[Ingesting] - <workflow-complete> {SendMessage execution_complete} {SendNotification workflow-complete} -> [Complete]
-- Is the ingestion-complete notification getting sent properly? Test when this is implemented.
[Ingesting] - <workflow-failed> {SendMessage execution_failed} {SendNotification ingest-failed} -> [Failed]
-- New email template for ingest-failed (send to DA list). Figure out what we need to do in this case.

`std_cms_imaging`

-- Current
[Start] - <capability-submitted> {QueueWorkflow std_cms_imaging} -> [Queued]
[Queued] - <resume> {ExecuteWorkflow std_cms_imaging} -> [Executing]
[Executing] - <workflow-complete> {SendMessage execution_complete} {SendNotification workflow-complete} -> [Complete]
[Executing] - <workflow-failed> {SendMessage execution_failed} {SendNotification workflow-failed} -> [Failed]

-- Ideal
Leave them the same until we get stories for it.

QA process should be identical to std_calibration.

`std_restore_imaging`

-- Current
[Start] - <capability-submitted> {QueueWorkflow std_restore_imaging} -> [Queued]
[Queued] - <resume> {ExecuteWorkflow std_restore_imaging} -> [Executing]
[Executing] - <workflow-complete> {SendMessage execution_complete} {SendNotification workflow-complete} -> [Complete]
[Executing] - <workflow-failed> {SendMessage execution_failed} {SendNotification workflow-failed} -> [Failed]

-- Ideal
Leave them the same until we get stories for it.

QA process will need to be figured out. Calibration was already QA'ed, so do we need another QA stage for the images?

`restore_cms`

-- Current
[Start] - <capability-submitted> {QueueWorkflow restore_cms} -> [Queued]
[Queued] - <resume> {ExecuteWorkflow restore_cms} -> [Executing]
[Executing] - <workflow-complete> {SendMessage execution_complete} {SendNotification workflow-complete} -> [Complete]
[Executing] - <workflow-failed> {SendMessage execution_failed} {SendNotification workflow-failed} -> [Failed]

State Machine Templating

As you can see from the above state machine definitions, the state machines for our capabilities right now fall into one of two categories: a simple run-workflow state machine and a run-workflow-with-QA state machine.

As such, this seems like a perfect opportunity to create in-code templates for these state machine categories. Doing this would allow us to much more easily change state machine definitions and keep track of those changes. Additionally, it would prevent many small mistakes like typos in capability names and event types, or typos in transitions IDs or other foreign-key relationship fields. These values would be able to be calculated in the code and ensured that they are correct. Another benefit is easy creation of new capabilities, as we can simply assign them a state machine template and the code will do the work for us.

A downside of this would be that we would be greatly encouraged to think of state machines in terms of templates and categories, rather than distinct and unique constructs in and of themselves. So capabilities with edge cases that need covering may be neglected by this system (or at least not fill well into the bounds of it). Also, actions for each transition will be templated as well, further reducing flexibility in edge case scenarios.

Current Templates (that I can see)

Simple run-workflow

[Start] - <capability-submitted> {QueueWorkflow (workflow_name)} -> [Queued]
[Queued] - <resume> {ExecuteWorkflow (workflow_name)} -> [Executing]
[Executing] - <workflow-complete> {SendMessage execution_complete} {SendNotification workflow-complete} -> [Complete]
[Executing] - <workflow-failed> {SendMessage execution_failed} {SendNotification workflow-failed} -> [Failed]

Run-workflow with single-stage QA (exact design still in-flux) (EVLA)

[Start] - <capability-submitted> {QueueWorkflow (workflow_name)} -> [Queued]
[Queued] - <resume> {ExecuteWorkflow (workflow_name)} -> [Executing]
[Executing] - <workflow-complete> {AnnounceQa} -> [Awaiting QA]
[Executing] - <workflow-failed> {SendMessage execution_failed} {SendNotification workflow-failed} -> [Failed]
[Awaiting QA] - <qa-pass> {QaPass} -> [Ingesting]
-- Should we run QaFail on other versions after QaPassing one? Yes, we should. But we need to be able to execute workflows on specific versions (WS-799)
[Awaiting QA] - <qa-fail> {QaFail} -> [Failed]
-- Make sure child workflow ID and parent workflow ID are checked and aligned properly with each other; compare parent ID with workflow ID of execution's current WF ID; this should match them properly
[Ingesting] - <workflow-complete> {SendMessage execution_complete} {SendNotification workflow-complete} -> [Complete]
-- Is the ingestion-complete notification getting sent properly? Test when this is implemented.
[Ingesting] - <workflow-failed> {SendMessage execution_failed} {SendNotification ingest-failed} -> [Failed]
-- New email template for ingest-failed (send to DA list). Figure out what we need to do in this case.

QA could be handled as a decorator pattern, since QA will vary between capabilities. But a workflow will always need to be run. So we could have the run-workflow steps and simply "attach" QA states and transitions to the end based on programmed parameters (on the capability level)

Edited Dec 06, 2021 by Nathan Hertz

Admin message

WS-807: State machine audit

Changes

State machine documentation page

Capability State Machines

null

null_dag

test_download

std_calibration

std_cms_imaging

std_restore_imaging

restore_cms