Plan for CIPL (WS-215)
======================

This document was created for
`WS-256 <https://open-jira.nrao.edu/browse/WS-256>`__, to provide a
design and plan for
`WS-215 <https://open-jira.nrao.edu/browse/WS-215>`__.

-  `CIPL Replacement
   Stories <#PlanforCIPL(WS215)-CIPLReplacementStori>`__

-  `CIPL Replacement Plan <#PlanforCIPL(WS215)-CIPLReplacementPlan>`__

-  `Requirements <#PlanforCIPL(WS215)-Requirements>`__

-  `Design Changes <#PlanforCIPL(WS215)-DesignChanges>`__

   -  `Consequences <#PlanforCIPL(WS215)-Consequences>`__

      -  `New Restrications <#PlanforCIPL(WS215)-NewRestrications>`__

      -  `New Benefits <#PlanforCIPL(WS215)-NewBenefits>`__

-  `Design issues <#PlanforCIPL(WS215)-Designissues>`__

   -  `Why do we have these
      invariants? <#PlanforCIPL(WS215)-Whydowehavetheseinva>`__

CIPL Replacement Stories
------------------------

The steps under each of these refers to the steps enumerated in the
second section.

-  As a DA I want to calibrate a VLA data set (WS-207)

   -  Step 1

-  As a DA I want a list of calibrations (WS-215)

   -  Step 2

-  As a DA I want the list of calibrations to be updated automatically
   by ingestion of new observations

   -  Step 3

-  As a DA I want to be able to pass and fail versions of a calibration
   request by clicking a button

   -  Steps 4, 5, 10, 12, 13, 19

-  As a DA I want the current qaFail work to be done when I click the
   Fail button

   -  Steps 7, 8, 9, 16a, 16d, 16e, 17, 18

-  As a DA I want the current qaPass work to be done when I click the
   Pass button

   -  Steps 6, 16c

-  As a DA I want the other versions to be failed when I click the Pass
   button

   -  Step 11, 14, 15, 16b

-  As a DA I want a script to appear in the Condor working dir to create
   a new version of this calibration, reusing already fetched files,
   with the modified files

-  As a DA I want to be able to view and edit qa_notes.html 
   through-the-web (a file located in the weblog)

CIPL Replacement Plan
---------------------

1.  A new capability is defined: **standard calibration**

    1. The step sequence for this capability is:

    2. PREPARE AND RUN WORKFLOW standard_calibration

    3. AWAIT WORKFLOW

2.  A page is created to show a list or table of requests for a given
    capability

    1. Each request has buttons for **submit** and **cancel**

    2. Each request shows the current state

    3. Each request has a link to the request page

3.  Workspaces catches ingestion-complete  events from the archive

    1. Workspaces creates, but does not submit, standard calibration
       capability requests in response to these

    2. DAs can press the submit button to start the request processing,
       or cancel to remove it

    3. DAs can pause or resume processing, or modify the concurrency
       limit (question: on the list page or on a separate capability
       definition page?)

    4. At some future date, workspaces can be changed to both create and
       submit these requests

4.  Capabilities gain three new fields:

    1. *Requires QA*, a boolean (true if this is a capability that
       requires QA)

    2. *Pass workflow,* a reference to a workflow to execute on QA Pass

    3. *Fail workflow,* reference to a workflow to execute on QA Fail

5.  The **standard calibration** capability is marked *Requires QA=True*

6.  A new workflow is defined for standard-calibration-pass-qa which
    does the work currently done by the qaPass  utility

7.  A new workflow is define for standard-calibration-fail-qa which does
    the work currently done by the qaFail  utility

8.  The **standard calibration** capability gets these two workflows in
    the *Pass/Fail workflow* fields

9.  Capability requests gain two new states: *Awaiting QA, QA Workflow
    Running*, which come between *Executing* and *Complete* on QA
    capabilities

10. Capability requests gain a new field: *Accepted version*

    1. On capabilities without QA, this is a reference to the newest
       version

    2. On capabilities with QA, this is a reference to the version that
       was QA passed

11. A new table called *qa_workflows* is defined with the following
    columns:

    1. capability_request: the request in question

    2. capability_version: the version the workflow was executed on

    3. workflow_request: the ID of the workflow request we launched

    4. role: either passed  or failed

    5. submitted_on: a date for when we submitted the workflow

12. The UI displays **QA Pass** and **QA Fail** buttons per version if
    the capability has the property *Requires QA*

13. Pressing **QA Pass** or **QA Fail** does a REST call to send a
    message with the selected version and the QA event

14. Workflow service gains the ability to allow additional tracking keys
    to be placed on a request

15. wf_monitor is changed to pass the tracking keys back on the events
    it generates

16. The capability engine is modified to catch these events and perform
    the following work, in order:

    1. If there is a *Fail workflow* defined and the event is QA Fail:

       1. The state is changed to *QA Workflow Running*

       2. The *Fail workflow* is executed and tracked in the
          *qa_workflows* table

       3. We wait for the fail workflow to complete

    2. If the event is QA Pass:

       1. A "QA Fail" message is sent for all the other versions of this
          capability request

    3. If the event is QA Pass and there is a *Pass workflow* defined:

       1. Do the same things that are done for the fail workflow, only
          with the pass workflow

    4. The *Accepted version* is updated to point to the selected
       version

    5. The state is changed to *Complete*

17. The capability service is modified to send tracking keys of qa:
    True  and qa: False to workflow requests depending on whether they
    are initiated on behalf of QA

18. The capability service handles qa: False events as normal; qa: True 
    events are used to determine if the request is ready to be
    transitioned to Complete

19. If the capability does not have *Requires QA* set, completion of an
    execution updates the *Accepted version* field to point to the
    just-completed version.

Requirements
------------

These are the big-ticket requirements that necessitate significant
change to the design:

-  Permit multiple versions of a request to be awaiting QA

-  On passing a version in a request, fail the other versions of this
   request

-  Be able to pass previously failed requests, or fail previously passed
   requests, an arbitrary number of times

These additional requirements are not as big:

-  Be clear about the distinction between HTCondor's working directory
   and CASA's working/ directory

-  Use the HTC facility to keep the local weblogs directory up-to-date
   during the job

-  Simple way to start a new version based on locally-edited files in
   the Condor working dir (a templated retry.sh  file for instance)
   reusing fetched files

-  Need a way to view and edit qa_notes.html  through-the-web (a file
   located in the weblog)

   -  Ideally: templates for this file

   -  Ideally: standard messages to copy/paste in

   -  Possibly: different standard messages per capability (later)

-  Web-based file upload is OK

-  Automatic detection and upload of the handful of files they actually
   edit might be simpler (PPR.xml, <FSID>.flagtemplate.txt, flux.csv)

Design Changes
--------------

Consequences
~~~~~~~~~~~~

There are some consequences to the winning design, which I enumerate
here for the sake of the stakeholders:

New Restrications
^^^^^^^^^^^^^^^^^

-  **QA can only happen at the end of a request**

   -  This implies that standard imaging will have to be a follow-on
      request from standard calibration, so that there can be two QA
      steps

   -  This is a consequence of the idea that QA Pass automatically fails
      other versions, and that QA Pass and QA Fail can occur multiple
      times on one version

-  Relatedly, **There can only be one flavor of QA per capability**

   -  There was at one time discussion of handling a QA process in which
      DAs do first pass and then AODs do the final pass. This will have
      to be handled through another regime (i.e. reassignment of QA
      duties to another person, after authentication is implemented)

New Benefits
^^^^^^^^^^^^

-  New invariant\ **: Only one version of a request can be in a passed
   state**

-  **Each capability can have distinct QA pass and fail workflows**

-  **QA state is held on the request, not the execution**

-  **Audit trail of pass-fail workflows executed**

Design issues
-------------

The original design for Workspaces was that QA would be a step in the
step sequence and only one version could be active at a time. A typical
day for a DA doing CIPL includes some things that violate that design.
They include:

1. | **Multiple active versions of a request**
   | It is normal for the DA to examine the calibration during QA and
     postpone making a determination until after generating another
     version of the calibration just to see if it's better with
     different flagging or something.

2. | **The accepted version may not be the latest version of a request**
   | In circumstances where several calibrations were generated, there's
     no reason to assume that the last one is the best one. Often an
     earlier version is the correct one.

3. **QA Pass implies QA Fail on the other versions
   **\ If the DA does a QA Pass on version 3, versions 1-2 and 4-7
   should automatically get QA Failed.

4. **QA Pass and QA Fail have different mutually-exclusive side-effects
   **\ QA Pass causes the weblog and calibration to be ingested, and
   marks the observation as *calibrated*. QA Fail causes the observation
   to be marked *do not calibrate*, and may additionally need to ingest
   just the weblog. Prior to either of these, the observation is in a
   *calibrating* state.

5. **Calibrations can be QA Failed after they have been QA Passed
   **\ It is not uncommon for a week to go by with a passed calibration,
   only for feedback from the user or operations causing the DAs to mark
   a previously passed calibration as failed, and thus needing to
   generate a new version.

It should be clear that several of our design constraints are broken by
these facts:

1. Version constraints:

   1. There is **only one** active version

   2. The **latest version** is the active version

   3. There cannot be active executions on inactive versions

   4. There is no mandatory acceptance step for versions

2. Capability constraints:

   1. The step sequence is sequential

   2. QA Fail causes the step sequence to abort early

   3. There is no conditional logic in the step sequence

Why do we have these invariants?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The version constraints 1.a, 1.b and 1.c are meant to simplify reasoning
about capability requests. In the VLASS manager it is very difficult to
determine what the correct version of a product is, since there can be
many versions, each having many executions, and there can be differences
in the file structure between executions in a single version. In the
until-now workspaces system, the last execution of the last version of a
capability request is the correct version.

The constraint 1.d is imposed because we cannot expect external users to
return to the system to mark their satisfaction with a request with some
positive action. We have to assume that they will come back to complain
or get help only if it is dissatisfactory for some reason, thus
necessitating a new version. So we have to assume that a request is
complete when the current version has a completed execution, until we
are informed otherwise.

The capability constraints under 2 are meant to keep the system simple
and prevent the step sequence from turning into a full-blown programming
language.