Plan for CIPL (WS-215) ====================== This document was created for `WS-256 <https://open-jira.nrao.edu/browse/WS-256>`__, to provide a design and plan for `WS-215 <https://open-jira.nrao.edu/browse/WS-215>`__. - `CIPL Replacement Stories <#PlanforCIPL(WS215)-CIPLReplacementStori>`__ - `CIPL Replacement Plan <#PlanforCIPL(WS215)-CIPLReplacementPlan>`__ - `Requirements <#PlanforCIPL(WS215)-Requirements>`__ - `Design Changes <#PlanforCIPL(WS215)-DesignChanges>`__ - `Consequences <#PlanforCIPL(WS215)-Consequences>`__ - `New Restrications <#PlanforCIPL(WS215)-NewRestrications>`__ - `New Benefits <#PlanforCIPL(WS215)-NewBenefits>`__ - `Design issues <#PlanforCIPL(WS215)-Designissues>`__ - `Why do we have these invariants? <#PlanforCIPL(WS215)-Whydowehavetheseinva>`__ CIPL Replacement Stories ------------------------ The steps under each of these refers to the steps enumerated in the second section. - As a DA I want to calibrate a VLA data set (WS-207) - Step 1 - As a DA I want a list of calibrations (WS-215) - Step 2 - As a DA I want the list of calibrations to be updated automatically by ingestion of new observations - Step 3 - As a DA I want to be able to pass and fail versions of a calibration request by clicking a button - Steps 4, 5, 10, 12, 13, 19 - As a DA I want the current qaFail work to be done when I click the Fail button - Steps 7, 8, 9, 16a, 16d, 16e, 17, 18 - As a DA I want the current qaPass work to be done when I click the Pass button - Steps 6, 16c - As a DA I want the other versions to be failed when I click the Pass button - Step 11, 14, 15, 16b - As a DA I want a script to appear in the Condor working dir to create a new version of this calibration, reusing already fetched files, with the modified files - As a DA I want to be able to view and edit qa_notes.html through-the-web (a file located in the weblog) CIPL Replacement Plan --------------------- 1. A new capability is defined: **standard calibration** 1. The step sequence for this capability is: 2. PREPARE AND RUN WORKFLOW standard_calibration 3. AWAIT WORKFLOW 2. A page is created to show a list or table of requests for a given capability 1. Each request has buttons for **submit** and **cancel** 2. Each request shows the current state 3. Each request has a link to the request page 3. Workspaces catches ingestion-complete events from the archive 1. Workspaces creates, but does not submit, standard calibration capability requests in response to these 2. DAs can press the submit button to start the request processing, or cancel to remove it 3. DAs can pause or resume processing, or modify the concurrency limit (question: on the list page or on a separate capability definition page?) 4. At some future date, workspaces can be changed to both create and submit these requests 4. Capabilities gain three new fields: 1. *Requires QA*, a boolean (true if this is a capability that requires QA) 2. *Pass workflow,* a reference to a workflow to execute on QA Pass 3. *Fail workflow,* reference to a workflow to execute on QA Fail 5. The **standard calibration** capability is marked *Requires QA=True* 6. A new workflow is defined for standard-calibration-pass-qa which does the work currently done by the qaPass utility 7. A new workflow is define for standard-calibration-fail-qa which does the work currently done by the qaFail utility 8. The **standard calibration** capability gets these two workflows in the *Pass/Fail workflow* fields 9. Capability requests gain two new states: *Awaiting QA, QA Workflow Running*, which come between *Executing* and *Complete* on QA capabilities 10. Capability requests gain a new field: *Accepted version* 1. On capabilities without QA, this is a reference to the newest version 2. On capabilities with QA, this is a reference to the version that was QA passed 11. A new table called *qa_workflows* is defined with the following columns: 1. capability_request: the request in question 2. capability_version: the version the workflow was executed on 3. workflow_request: the ID of the workflow request we launched 4. role: either passed or failed 5. submitted_on: a date for when we submitted the workflow 12. The UI displays **QA Pass** and **QA Fail** buttons per version if the capability has the property *Requires QA* 13. Pressing **QA Pass** or **QA Fail** does a REST call to send a message with the selected version and the QA event 14. Workflow service gains the ability to allow additional tracking keys to be placed on a request 15. wf_monitor is changed to pass the tracking keys back on the events it generates 16. The capability engine is modified to catch these events and perform the following work, in order: 1. If there is a *Fail workflow* defined and the event is QA Fail: 1. The state is changed to *QA Workflow Running* 2. The *Fail workflow* is executed and tracked in the *qa_workflows* table 3. We wait for the fail workflow to complete 2. If the event is QA Pass: 1. A "QA Fail" message is sent for all the other versions of this capability request 3. If the event is QA Pass and there is a *Pass workflow* defined: 1. Do the same things that are done for the fail workflow, only with the pass workflow 4. The *Accepted version* is updated to point to the selected version 5. The state is changed to *Complete* 17. The capability service is modified to send tracking keys of qa: True and qa: False to workflow requests depending on whether they are initiated on behalf of QA 18. The capability service handles qa: False events as normal; qa: True events are used to determine if the request is ready to be transitioned to Complete 19. If the capability does not have *Requires QA* set, completion of an execution updates the *Accepted version* field to point to the just-completed version. Requirements ------------ These are the big-ticket requirements that necessitate significant change to the design: - Permit multiple versions of a request to be awaiting QA - On passing a version in a request, fail the other versions of this request - Be able to pass previously failed requests, or fail previously passed requests, an arbitrary number of times These additional requirements are not as big: - Be clear about the distinction between HTCondor's working directory and CASA's working/ directory - Use the HTC facility to keep the local weblogs directory up-to-date during the job - Simple way to start a new version based on locally-edited files in the Condor working dir (a templated retry.sh file for instance) reusing fetched files - Need a way to view and edit qa_notes.html through-the-web (a file located in the weblog) - Ideally: templates for this file - Ideally: standard messages to copy/paste in - Possibly: different standard messages per capability (later) - Web-based file upload is OK - Automatic detection and upload of the handful of files they actually edit might be simpler (PPR.xml, <FSID>.flagtemplate.txt, flux.csv) Design Changes -------------- Consequences ~~~~~~~~~~~~ There are some consequences to the winning design, which I enumerate here for the sake of the stakeholders: New Restrications ^^^^^^^^^^^^^^^^^ - **QA can only happen at the end of a request** - This implies that standard imaging will have to be a follow-on request from standard calibration, so that there can be two QA steps - This is a consequence of the idea that QA Pass automatically fails other versions, and that QA Pass and QA Fail can occur multiple times on one version - Relatedly, **There can only be one flavor of QA per capability** - There was at one time discussion of handling a QA process in which DAs do first pass and then AODs do the final pass. This will have to be handled through another regime (i.e. reassignment of QA duties to another person, after authentication is implemented) New Benefits ^^^^^^^^^^^^ - New invariant\ **: Only one version of a request can be in a passed state** - **Each capability can have distinct QA pass and fail workflows** - **QA state is held on the request, not the execution** - **Audit trail of pass-fail workflows executed** Design issues ------------- The original design for Workspaces was that QA would be a step in the step sequence and only one version could be active at a time. A typical day for a DA doing CIPL includes some things that violate that design. They include: 1. | **Multiple active versions of a request** | It is normal for the DA to examine the calibration during QA and postpone making a determination until after generating another version of the calibration just to see if it's better with different flagging or something. 2. | **The accepted version may not be the latest version of a request** | In circumstances where several calibrations were generated, there's no reason to assume that the last one is the best one. Often an earlier version is the correct one. 3. **QA Pass implies QA Fail on the other versions **\ If the DA does a QA Pass on version 3, versions 1-2 and 4-7 should automatically get QA Failed. 4. **QA Pass and QA Fail have different mutually-exclusive side-effects **\ QA Pass causes the weblog and calibration to be ingested, and marks the observation as *calibrated*. QA Fail causes the observation to be marked *do not calibrate*, and may additionally need to ingest just the weblog. Prior to either of these, the observation is in a *calibrating* state. 5. **Calibrations can be QA Failed after they have been QA Passed **\ It is not uncommon for a week to go by with a passed calibration, only for feedback from the user or operations causing the DAs to mark a previously passed calibration as failed, and thus needing to generate a new version. It should be clear that several of our design constraints are broken by these facts: 1. Version constraints: 1. There is **only one** active version 2. The **latest version** is the active version 3. There cannot be active executions on inactive versions 4. There is no mandatory acceptance step for versions 2. Capability constraints: 1. The step sequence is sequential 2. QA Fail causes the step sequence to abort early 3. There is no conditional logic in the step sequence Why do we have these invariants? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The version constraints 1.a, 1.b and 1.c are meant to simplify reasoning about capability requests. In the VLASS manager it is very difficult to determine what the correct version of a product is, since there can be many versions, each having many executions, and there can be differences in the file structure between executions in a single version. In the until-now workspaces system, the last execution of the last version of a capability request is the correct version. The constraint 1.d is imposed because we cannot expect external users to return to the system to mark their satisfaction with a request with some positive action. We have to assume that they will come back to complain or get help only if it is dissatisfactory for some reason, thus necessitating a new version. So we have to assume that a request is complete when the current version has a completed execution, until we are informed otherwise. The capability constraints under 2 are meant to keep the system simple and prevent the step sequence from turning into a full-blown programming language.