#+TITLE: Workspace Architecture: Overview
#+AUTHOR: Daniel K Lyons and the SSA Team
#+DATE: 2019-12-09
#+SETUPFILE: https://fniessen.github.io/org-html-themes/setup/theme-readtheorg.setup
#+HTML_HEAD_EXTRA: <link rel="stylesheet" type="text/css" href="extra.css" />
#+OPTIONS: H:5 ':t

* Introduction

A key ingredient in the initiative to deliver science-ready data products is a mechanism to produce those products. The workspace system provides a bulk processing facility for this purpose. The key ideas of this facility are:

- Processing is quality-assured
- Processing is estimable, visible and cancellable
- Processing utilizes large clusters for high throughput, such the local cluster or the public Open Science Grid
- Processing may be set up prior to the availability of input products, and will kick off when they arrive
- Processing options are edited by users and the provenance is tracked

The architecture presented here was developed using Attribute-Driven Design. The design iterations are available in the [[./Design-Iterations.org][Design Iterations document]].

The overall architecture here is that of a web application atop two services: a capability service and a lower-level workflow service that it uses.

[[./images/overview.png]]

The workflow service provides lower level access to non-local computing in various clusters. Workflows are somewhat generic; they have their own inner structure of tasks and task dependencies, but they don't explicitly know anything about science products or the other high-level concerns of scientists and NRAO staff. The capability service is higher-level, dealing with products explicitly, handling quality assurance, and handling parameters.

These two major services form packages, each built from smaller services:

[[./images/packages.png]]

These services each have a simple role: the *notification service* sends email notifications to users, the *help desk service* facilitates creating, locating, closing and linking to science help desk tickets, the *scheduling service* allows us to perform actions routinely on a schedule, and the *estimation service* retains and publishes timing metrics for capabilities.

Alongside these services are several external systems: the archive (meaning the system which manages the NRAO archive), the authentication/authorization (A3) service, the science help desk, the messaging subsystem, HTCondor and the OpenScienceGrid. Processing is realized by making requests for capabilities.

[[./images/externals.png]]

Within the system are several shared components that act as services internally but which are not exposed: the messaging subsystem (which is shared with the archive), the notification system, and the scheduling system. Additionally, there is the estimation system, which passively collects metrics data for analytics but also exposes an API for retrieving estimates of how long work will take.

** Aside about user interfaces

I have left the workspace UI as a mostly blank box. Early on, we decided to leave the workspace UI underspecified for the sake of agility. We have interpreted requirements that explicitly mention user interaction as instead requiring functionality in the service layers which the UI can use to implement the requirements. For this reason, almost no actual work is allocated to the workspace UI. Instead the workspace UI is only mediating access to the user.

It is worth noting that the UI will necessarily break down into several sections based on their primary purpose and intended audience, as described by this diagram.

[[./images/ui-components.png]]

The editor interface will expose the create-edit-delete operations for capabilities and workflows, but otherwise goes unmentioned in the rest of the document.

It can be assumed that the workspace UI will eventually decompose into some code running in the browser backed by some code running on a web server. The nature of this breakdown is left unspecified for now, but is likely (leveraging the strengths of the SSA team) to be based on Angular 2.0 and Python. The web developers will iterate directly with stakeholders to build a useful UI using the components designed in this architecture.

* Capability Requests and Their Submission

Let us turn now to the story of a user submitting a capability request.

** By finding data

The story begins with our user selecting some data of interest to her in the archive search interface.

[[./images/choose-data-for-processing.png]]

She then requests processing on the item she selected. She is prompted to log in. After successfully authenticating, she is prompted for some additional settings for this processing request. She then submits the request and waits for it to complete. This is represented in this diagram:

[[./images/archive-request.png]]

*** Authentication and Authorization

There are two reasons a capability request may require authorization. One is proprietary data, the other is restricted capabilities.

Most capabilities are available to anyone to be invoked, but some are restricted to the group that performs QA. These are called restricted capabilities. System capabilities like standard calibration (formerly CIPL) and eventually standard imaging will be restricted, because their post-QA step includes ingestion. VLASS capabilities will also likely be restricted because of the complexity of keeping track of what has been done.

The more familiar restriction has to do with proprietary data. This is why it is important that when a request comes in, the user making the request along with the data they want to operate on must be forwarded to the authorization service to be checked for access. During the proprietary period, only the observer can access the data.

Authentication is the process of confirming the /identity/ of a user. Authorization is the process of confirming that a particular user has a certain right—in our case, access to proprietary data or restricted capabilities. Resource allocation is tied to the identity of a user. There is a longstanding plan to produce a shared "Authentication/Authorization/Allocation" or "A3" service. For the time being, the workspace will have to encompass an A3 service of its own, which in time will become a proxy to the real service, once it exists.

** By the result of another request

There is another way a user could make a request, starting from an earlier capability request. Suppose we have another user looking at some pending processing requests in the workspace itself. The user sees a request for a calibration. He wants to use that calibration to make an image, even though the calibration process isn't complete. The user chooses the request and requests processing on it. He is prompted for some additional parameters for this reprocessing request. He then submits the request and waits for it to complete.

The fact that the user could view the processing request implies that he had access to the results of it, so no additional authentication was required.

[[./images/capability-request-request.png]]

In both cases the result was the same, but the starting state was different.

** By internal systems

Standard calibration (and eventually standard imaging) will work by sending capability requests to the system automatically as data is ingested into the archive. The archive has a rules engine component called amygdala, which notices new product ingestions and reacts by dispatching CIPL (the "CASA Integrated PipeLine") workflow to do automatic calibration. This diagram illustrates:

[[./images/amygdala-request.png]]

The architecture presented so far reveals some missing functionality here in the form of an update to this rules engine to make it more flexible. Several use-cases are identified, but as this was a late discovery, the plan for now must be to proceed with a small change to the rules engine to dispatch capabilities rather than workflows. We expect to revisit this later.

#+BEGIN_COMMENT
The use cases here are:

- handling VLASS
- handling the weekly stress test
#+END_COMMENT

** By VLASS

VLASS (the Very Large Array Sky Survey) is also a client of the capability service. VLASS processing will be implemented as a suite of extra capabilities, which in turn may or may not rely on extra VLASS workflows. VLASS does quite a bit of custom processing, which I discuss later in this document.

* Capability Structure

Capability requests have several relationships with other objects, which are shown in this diagram:

[[./images/capability-requests.png]]

Communication between analysts or large project nominated users and end-users is mediated by the help desk system. So tickets can be created by people on either side, but are always associated to a particular request.

Every request that is ready to be executed will have at least one version and one execution. The purpose of the version is to hold onto different parameter choices. The purpose of the execution is to track a particular attempt to make that version. Only one execution under a given request can be executing at a time.

* Request Processing

Once a capability is submitted, an initial version is created and an initial execution record is created under that version. The execution record is then placed in the execution pool. The execution pool receives events from the archive about product availability, from the workspace UI about quality assurance and large allocation status, and from the workflow system. The pool routes these events to the appropriate executions, causing them to change state. Once the request reaches a Prepare-And-Run-Workflow step, it is placed in the queue for the relevant capability, where it awaits being selected for execution. Once the Prepare-And-Run-Workflow step is performed, the capability execution is returned to the pool for the Await-Worfklow step, where it awaits a "workflow complete" message.

The queue is a priority queue, and executes requests in priority order. The requirements stipulate that triggered observations, target-of-opportunity or director's discretionary time count as high priority. The priority only matters inside a given queue: high priority requests of a certain capability will come before low priority requests of the same capability. If there are multiple requests with the same priority level, the one submitted first will be executed first. There is nothing explicit in this design about priority /across/ queues, for instance to make standard calibration take priority over optimized imaging. But, it would be possible to leverage HTCondor to achieve cross-queue priorities by modifying the workflow's templates (and possibly HTCondor's configuration).

Queues can be paused to facilitate upgrades of CASA or instrument reconfiguration. Queues may also optionally have a concurrency limit. This prevents built-up requests from flooding the cluster after resuming from a pause.

Once the execution is selected, the capability info is consulted to acquire some information about this capability, namely the step sequence which is copied into the execution (to prevent strange behavior if the definition is changed while executions are in-flight). A capability engine walks the steps of the capability sequence, executing each one in turn. The step sequence will contain some steps for waiting for products, some for waiting for user input parameters, and some for executing workflows. This shows the sequence of events:

[[./images/request-submission.png]]

The entities in play here are shown in this diagram:

[[./images/capability-execution-bdd.png]]

Each of these entities has a job:

- Capability Request :: Represents the request itself, holds all the versions and knows what the final outcome was.
- Request Version :: Represents a particular "take," the options chosen for it, and holds all the attempts to produce a result from those options.
- Capability Execution :: Represents an attempt to execute the capability with this set of options, and knows what it's execution state is.
- Capability Execution Pool :: Holds all executions in an AWAIT state
- Capability Queue :: Holds all the executions for a certain capability and runs them in priority order.
- Capability Engine :: Does the actual execution of a capability by evaluating capability steps. Concurrency is managed by the queue by having the capability engines corresponding to the concurrency limit.
- Capability Step :: Does one piece of a capability, such as launching a workflow or waiting for products or quality assurance (details below).
- Capability Sequence :: The list of capability steps that implement a capability.

There are five kinds of capability step:

- Await product :: broadcasts a need for a certain product and then waits for a signal from the archive or the capability system that it is available
- Prepare and run workflow :: does some work to set up and begin executing a workflow
- Await workflow :: waits for a signal that it is complete
- Await QA :: sends message that QA is needed, waits for QA status change message
- Await large allocation approval :: checks the estimated time of the request; if it's too large, waits for a signal that allocation approval is granted

*** Capability execution interruption

There is a potential here for drama, if the power goes out during capability execution. While the capability info will be storing our state so that we can resume execution, we must consider what happens if a step was in some partially executed state when power was out. What happens if we re-do a step twice, for instance?

- Await product :: Check for the product; it's either available or not, so there is no harm repeating this step.
- Await workflow :: Check to see if the workflow is actually complete; if it is not, resume waiting. Again, no harm.
- Await QA :: Check for the QA status change; if it hasn't arrived yet, resume waiting. Still no harm.
- Prepare and run workflow :: The dangerous one. This does some calculation and then executes a workflow. If the calculation was interrupted, redoing it is harmless. If the workflow execution was started but not recorded, there is a chance that two workflows will be executing.

There does not appear, to me, to be a way for this design to result in lost work, only a way for extra processing to be executed. We'll have to think about this and how it could be detected in those cases where the capability service is restarted abruptly.

*** Preparing a workflow

Preparing a workflow requires a few steps of its own:

[[./images/prepare-execute-workflow.png]]

Here is a view in terms of the interactions with other objects:

[[./images/prepare-execute-workflow-seq.png]]

** Example: Imaging

Let's take a deeper look at an example capability. Let's say we're imaging; we have defined a workflow that fetches data and runs CASA and we have an ingestion workflow. To provide an imaging capability, we will need a calibration product, we will need to run CASA against it, and we will need to perform QA before delivering it. Here is what the corresponding capability step sequence might look like:

#+BEGIN_SRC
AWAIT PRODUCT cal://alma/...
PREPARE AND RUN WORKFLOW fetch-and-run-casa
AWAIT WORKFLOW COMPLETE
AWAIT QA
PREPARE AND RUN WORKFLOW ingest
AWAIT WORKFLOW COMPLETE
#+END_SRC

The capability engine will process this sequence in order, mostly by sending messages to other systems, as described here (bearing in mind this is an /example/ capability step sequence):

[[./images/generic-sequence.png]]

** Request and Execution States

Most of the time, what a user is interested in is actually the request /status/, which I define to be the state of the request, unless there is a currently executing step associated with one of its versions' executions, in which case, the name of that step. This simplifies the state model for requests to this:

[[./images/request-states.png]]

And executions to this:

[[./images/execution-states.png]]

With a request status that is either the request state, or if the request state is "Executing" and the corresponding execution is in "Executing Step," then whatever the current executing step is for the associated execution—for instance, fetching data, running CASA or delivering.

[[./images/execution-status.png]]

* Request Cleanup

There is a daily cleanup task executed by the scheduler which handles the requirements here.

[[./images/daily-cleanup.png]]

The structure of this is pretty simple:

[[./images/daily-cleanup-bdd.png]]

* Workflows

Workflows are the unit of processing used by the capability service. A workflow will encompass several steps, like fetching data, running CASA and performing delivery. These steps are not limited to sequential order; they can actually form a graph. This is to enable advanced concurrency setups like map/reduce or scatter/gather processing where many concurrent jobs perform the same task on a different increment of data, as needed by (for instance) VLASS. These details are hidden behind the abstraction; clients of the workflow have no idea how their workflow is executed or where.

The workflow system itself is ignorant of things like products, versions and provenance, and there are no aggregate collections of workflow executions underneath workflow requests or versions; running a workflow is more like running a program.

Workflows accept an input parameter and files. Workflows are transformed into jobs appropriate for HTCondor and managed by HTCondor DAGman. The sequence of steps looks like this:

[[./images/workflow-execution-act.png]]

Simultaneously, there is a process monitoring the HTCondor logs and generating events as the workflow process evolves. Both of these are illustrated by this sequence diagram:

[[./images/workflow-execution.png]]

In this manner, the message that a workflow is complete will be sent back to the capability system, where a capability step in a queue somewhere is waiting for it, as well as the estimation service.

** Why are we tightly-coupled to HTCondor?

The workflow system is hidden behind its own service. There is no direct dependency on the implementation of the workflows in the capability service; as far as it is concerned, a workflow is run by sending a name, a parameter and some files to a service and asking it to go.

Inside the workflow service, we use HTCondor's DAGman as the implementation of workflows. We have a concrete requirement to support OpenScienceGrid, which implies we must support HTCondor. Absent credible alternates, I consider this sufficient flexibility.

* Estimation, Notification, Help Desk

** Estimation Service

When capabilities are executed, messages are sent via the messaging subsystem. There is an estimation service listening for these messages with the intent of correlating the request parameters with the elapsed time between request and completion. The service will correlate these data and provide an API for obtaining a guesstimate for how long particular requests might take if they were submitted.

The service API here will be a single endpoint, to which a capability and parameter can be given, returning an estimate of how long it will take to execute.

** Notification Service

There are many situations in this system where a notification may need to be sent, either to the submitting user or to people responsible for doing quality assurance. The notification system will provide a high-level API for sending these sorts of notifications.

Both the capability service and the workflow service utilize templating to generate input files. The same templating system will be used here, so that notifications can be selected and initiated with some set of parameters and proper template rendering will occur.

** Help Desk Service

Scientists needing help request it using the Kayako science help desk, which mediates the communication between staff and users. The exact functionality of Kayako is unknown to our team, but what matters for our purposes is that tickets can be opened and closed and linked to. The Help Desk Service abstracts the exact nature of the science help desk from us, protecting us from change, but giving us access to two verbs, open and close, and giving us a way to track tickets via links.

Capability requests can be hooked up to one or more help desk tickets. Users and staff will both be able to use a UI in the workspace system to initiate a conversation with the other side, which will take place in the science help desk.

* Persistence with Capability Info and Workflow Info

Architecturally, how capabilities and workflows and whatnot are persisted is not especially significant. What matters in the architecture is knowing that it happens and which components are responsible for it. As shown in earlier sections, there are Capability Info and Workflow Info elements in the design which handle the lookup and persistence of capabilities and workflows respectively. Capability Info has collaborators, Project Settings and Capability Matrix, which handle some details specially.

Without getting deeply involved in the design, we can probably assume that these blocks will backend to a relational database for the workspaces:

[[./images/database.png]]

* Integration with Existing Systems

This is the design, but how to get from where we are to it is a question that warrants some exploration.

The bulk of the capability system is all-new, along with the UI. These portions can be built directly without affecting the existing systems. Moving the existing workflows into this regime, however, is not straightforward, nor is deprecating the VLASS manager.

** Archive

There are two points of integration with the archive: ingestion and making capability requests.

At product ingestion time, messages are sent over the pub/sub messaging system. This system is shared between the archive and the workspace system. The workspace receives these ingestion events and looks for active requests that are awaiting the newly ingested product; if found, they are notified and move on to the next step in their step sequence.

As discussed in [[By internal systems][the section on internal systems]], the archive also has a rules engine for dispatching workflows when certain events occur, like ingestion of certain products. This system will have to be updated to send capability requests instead, and may deserve some improvements over and beyond that.

The second point of integration is in the archive's UI for sending requests. This is going to change so as to obtain capabilities from the workspace system and forward the user and requested data to the workspace system.

** Existing workflows

The existing workflows serve many of the same needs as the new ones do, just worse. The plan for migration and integration is this:

1. Refactor several existing workflow jobs into standalone executables

This mainly pertains to delivery and running CASA. The data fetcher and ingestion are already basically standalone executables.

2. Refactor and migrate some workflow tasks into standalone executables

There are several current workflow tasks that will probably need to be refactored into standalone utilities. Ingestion, for instance, has a preparatory step that probably needs to be converted to this.

3. Ensure that internal workflows are mapped to the new workflow service

There are several clients of the existing workflow system, especially VLASS (discussed below) and the archive system. The workflow service will be internally-accessible to support these clients, but they will need to be updated to access the new service.

Apart from these areas, the bulk of the code in the existing workflows is either boilerplate or OODT-related cruft (scaffolding or replacement components). This code goes away, completely replaced by the new workflow service.

** VLASS

The VLASS software system has several major components:

- VLASS Manager :: A UI for handling VLASS processing
- VLASS Workflows :: A suite of workflows, used both explicitly by the VLASS manager and in a triggered fashion by the archive
- Scripts :: A poorly-defined suite of scripts that do various tasks manually for VLASS

Ultimately, it should be the case that:

- VLASS Manager is mostly replaced by the workspace UI
- VLASS workflows are entirely replaced by the capability system and its large-project support
- Reliance on one-off scripts dramatically reduced

I argue that the end-game scenario is achievable with the design we have now:

1. Workspace UI will support allowing large projects to define their own QA personnel.
2. Large projects can also define their own capabilities
3. QA system is sufficient for large projects including VLASS
4. Request-version-execution regime maps nicely from VLASS Manager's product-version-execution system
5. Capability request composition maps nicely from VLASS Manager's product dependencies

There are some features of the VLASS Manager that do not map onto features of the capability system, which will need to be handled somehow:

- Survey / tile completion tracking ("87% of T17t01 is imaged", "99% of Epoch 1.2 is imaged")
- Generating requests for various products of various minitiles and their components

For this reason I do not think it is possible to completely replace the VLASS Manager with the workspace UI.

*** Paths forward

The way forward is to worry about VLASS post-hoc. As long as the archive continues to generate the messages which the VLASS workflows and VLASS manager expect, it would be safe to bring the entire capability system online without touching the VLASS systems. Assuming the workspace system is in-place, the migration path would then be:

1. Migrate VLASS workflows into VLASS project-specific capabilities
2. Migrate data from VLASS database into capability info
3. Remove jobs/executions/QA tabs from VLASS manager. Alternatively, make it display the same data by retrieving it from the capability service
4. Refactor VLASS scripts for generating products to generate capability requests instead

As VLASS is effectively a client of the workspace system in the new regime, doing this is probably the right approach. We could interleave development here with development on the capability system, at some increased risk but with an accelerated timeline. The safety of this would be a function of how completely the workspace system is built when the integration is attempted.

Obviously this will leave you with a VLASS manager that is kind of a husk of its former glory. More work will be necessitated here, but I think solving all of VLASS's problems is probably not in-scope for the workspace system.

* On Errors

As with any large system, there are a lot of ways for things to go wrong. The following are addressed by this architecture.

** Hardware failures

A key benefit of tight coupling to HTCondor is that hardware failures of running processes does not cause the work to be lost completely. HTCondor will reschedule the work onto another machine. So the most obvious kinds of hardware error are handled by HTCondor itself.

Of the external mechanisms we use, our critical dependencies are on the HTCondor cluster, the database and the messaging systems. HTCondor's own management systems can fail, in which case our workflows won't be schedulable. This will manifest in our software as capability execution failures, which can be retried later. The database system going offline would be a significant disaster for almost all of our systems, but the database is routinely backed up and has its own disaster recovery mechanisms. The messaging system has gone offline before, which causes dependent systems to block until it comes back online. This can cause availability issues but tends not to lose data.

** New resources

What happens when new HPC systems are brought online? As long as the scientific computing group (SCG) provides access to new resources via HTCondor, the workspace system will be able to utilize them. HTCondor has several features here which are likely to make it a safe bet in the long run. For compatibility with other HPC software, HTCondor provides "glide ins"—a way of automatically setting up a minimal HTCondor environment on a single machine. This makes it very easy to support HTCondor on top of other software and hardware. Separately, for granular scheduling of work, HTCondor provides a powerful pattern matching system, based on their "classified ad" system. We anticipate using this to automatically push workflow executions into clusters that are local to the data they will be operating on.

** Networking

What is the effect workspaces will have on network utilization? The workspaces system itself doesn't necessarily affect network usage significantly. The reason is that the same kind of processing we are doing now is what will be done under workspaces, and the workload is similar. Workspaces doesn't, by itself, generate a significant amount of additional network traffic over and above the current workflow system.

We anticipate a significant effect from moving some processing into the Open Science Grid, whereupon processing will necessarily be nonlocal to the data. The SCG is working on a partnership with the Center for High Throughput Computing (CHTC), the authors of HTCondor, to figure out exactly how we'll need to address this. One approach would be creating data caches at Open Science Grid sites to reduce the amount of long-distance data transfer. There is probably no avoiding a significant increase in bandwidth utilization from propagating large datasets from here to OSG sites; the best we can probably hope for is to come up with a way to do it intelligently rather than wastefully, but again this will probably fall mostly on the shoulders of the SCG.

** Workflow failures

When individual workflow steps fail, HTCondor eventually cancels the workflow. Whatever work has completed so far isn't lost but is marked as having completed; the workflow can be manually rescheduled and proceed. This requires manual intervention by a human but suggests that workflows are more readily resumed than in the existing workflow system.

A number of interesting related system failures could hide behind a workflow failure: NGAS or Lustre issues, trouble with CASA versions, etc. In any event, HTCondor leaves copious logs and the messages from tasks will be available in the working directory to examine for post-mortem analysis.

** Capability execution failures

Capability steps mostly cannot fail owing to their simplicity. The only step that can fail is a workflow, which is addressed above. The capability will then enter an Error state; either someone fixes the capability and restarts it (causing it to return to the Executing Step state) or the whole execution is failed.

** Capability request failures

Capability requests cannot fail, although they can be abandoned. If an execution fails, a new execution can be created; if the failure was due to some transient cause (a software misconfiguration or missing resources or something) then it will be remedied by an additional execution. If the failure was due to bad parameters, a new request version can be created with fixed parameters, possibly with input from the help desk via the ticket mechanism.

** Monitoring and alerting

As SSA systems grow larger and more complex, detecting faults and responding in a timely way is becoming a larger and larger issue. The capability system is only going to increase this, and the distributed nature of the processing is going to create new opportunities for information necessary for debugging to be mislaid.

For this reason, we anticipate bringing online a new monitoring system. This system will not be specific to the workspace, but is expected to be shared with all of the SSA software. The monitoring system will provide a simple API for publishing statistics and logs from any component in the workspace system (or any other system). The information so collected can be visualized with an open-source client such as Grafana. Pro-active monitoring and alerting can also be done. We will be conducting the research on which system to use shortly, and this document will be updated when concrete decisions are made.

* Testing Plan

The high-level approach here is to follow the plan outlined in the book /Growing Object-Oriented Software/. We will build a "walking skeleton" consisting of all the necessary interfaces as stubs. First light will be a simple capability that does nothing or nearly nothing, to exercise all the pathways. Integration tests and unit tests will follow, with the meat of the implementation of a module following the unit tests for that module.

Our general approach will be Test-Driven Development, in which the system is modeled and unit tests are designed and implemented for each object in the system, along with integration and regression testing.

Integration testing will involve establishing and exercising expectations for interactions within and among the components. For example, the various services interface with the capability service as well as the messaging subsystem and the workflow service, which itself interacts with the messaging subsystem. We plan to use mock objects to represent the services so that the behavior of each service can be exercised without the need to instantiate and call methods on the actual objects, which could be time-consuming, difficult, and in some cases not possible. In similar fashion, every foreseeable scenario in the workspace's interaction with such entities as the archive, the science helpdesk, and others will be modeled and tested using mocks.

Regression testing is necessary to ensure that defects that have been addressed don't turn up again later. This can happen as a result of code changes not being committed to the source code repository, or of merging Tuesday's branch into the repo without pulling Monday's work down first.

The eventual goal is automated building and testing of the codebase, such that at regular intervals the system is automatically rebuilt to incorporate any code changes, then every single test is exercised, with immediate reporting of any failures.

* Technology Decisions

There are few technology choice surprises in this architecture. Most of the technologies we'll be using have been proven already in the archive project over many years and with VLASS.

The database backend will be PostgreSQL. The database abstraction layers will be either SQL Alchemy for Python or myBatis for Java.

The services will be written in Java or Python, depending on which is convenient or furnishes a necessary library. If Java, JAX-RS will be used. If Python, Pyramid will be used.

The user interfaces will be written with AngularJS and a Python/Pyramid backend.

Templates will, as much as possible, use Mustache, which is a language-independent system for doing templating.

One open question is how to handle metrics and proactive monitoring. There is some interest in using InfluxDB in the electronics division; another suggestion is Prometheus. As this would be a new system altogether, there is not much precedent for it in the observatory to follow. Details will be added to this document as they are found.

Tests will be written using JUnit and Mockito for Java and the built-in unittest library for Python, which also furnishes mock objects.

* Requirement Satisfaction

This section is intended to assist the CDR panel members with understanding how each requirement is satisfied. The details of how certain decisions were made as they pertain to a particular requirement can be found by consulting [[file:./Design-Iterations.org::*Requirement%20Satisfaction][the requirement satisfaction section of the design iterations document]]. The following table includes each requirement, it's text, and the corresponding items from the design that address the requirement.

| Requirement | Text | Components |
|----------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------|
| SRDP-L1-5.3 | If the user is not satisfied with the product (for whatever reason), they shall have the ability to return to their request or helpdesk ticket through a provided link, modify as necessary and resubmit. A simple mechanism shall be provided to request more assistance through a linked helpdesk ticket mechanism. | Request Version, Capability Execution |
| SRDP-L1-6.1 | When manual intervention for recalibration is required, the process shall be executed by the operations staff. The staff member shall work with the user to identify and resolve the issue and then resubmits the job for the user. At this point the process will re-enter the standard workflow. | Help Desk Service, Request Version, Capability Execution |
| SRDP-L1-6.2 | The archive interface shall provide status information for the user on each job, links to completed jobs, as well as the weblog for the job. | Workflow Service, Workspace UI |
| SRDP-L1-6.3 | Batch submission of jobs shall be throttled to prevent overwhelming processing resources. | Capability Queue |
| SRDP-L1-6.4 | The standard imaging process shall automatically be triggered for observations supported by SRDP once the standard calibration has passed quality assurance. | Future Product, Capability Queue, Await QA |
| SRDP-L1-6.5 | When the single epoch calibration and imaging for all configurations are complete, the data from all configurations shall be imaged jointly. | Future Product, Capability Queue |
| SRDP-L1-6.6 | The Time Critical flag shall persist throughout the lifecycle of the project and be made available to the data processing subsystems. | Capability Queue |
| SRDP-L1-6.6.1 | Processing of time critical proposals shall begin as soon as data is available. | Capability Queue, Future Product |
| SRDP-L1-6.6.2 | The workflow manager shall notify the PI immediately when calibration or imaging products are available, with specific notice that the products have not been quality assured. | Notification Service |
| SRDP-L1-6.6.3 | In cases of reduction failure, a high priority notification to operations shall be made so that appropriate manual mitigation can be done. Note that this may occur outside of normal business hours. | Notification Service |
| SRDP-L1-6.7 | Large Project processing shall allow use of custom or modified pipelines to process the data and the project team shall be directly involved in the quality assurance process. | Project Settings |
| SRDP-L1-6.7.1 | The SRDP system shall allow use of NRAO computing resources for the processing of the large project data provided that required computing resources does not exceed the available resources (including prior commitments). | Project Settings |
| SRDP-L1-6.8 | Once a job is created on archived data, the archive interface shall provide the user an option to modify the input parameters and review the job prior to submission to the processing queue. | Capability Request, Workspace UI |
| SRDP-L1-6.9 | Results from reprocessing archive data are temporary and the automated system shall have the ability to automatically enforce the data retention policy. | Scheduling Service, Cleanup |
| SRDP-L1-6.9.1 | Warnings shall be issued to the user 10 and three days prior to data removal. | Scheduling Service, Cleanup Warning |
| SRDP-L1-6.10 | The workflow system shall automatically start the execution of standard calibration jobs. | Capability Request, Future Product, Standard Calibration |
| SRDP-L1-6.10.1 | It shall be possible for a user to inhibit the automatic creation of calibration jobs. For instance after a move, prior to new antenna positions being available. | Capability Queue |
| SRDP-L1-6.11 | The user shall be able to cancel jobs and remove all associated helpdesk tickets. | Helpdesk Service, Capability Service, Workspace UI |
| SRDP-L1-6.12 | The user shall be provided an estimate of the total latency in product creation. | Estimation Service |
| SRDP-L1-6.13 | The workspace system shall provide interfaces to allow review and control of the activities in the workspace. | Workspace UI |
| SRDP-L1-6.13.1 | An interface that allows users to interact with their active and historical processing requests shall be provided. | Workspace UI |
| SRDP-L1-6.13.2 | An interface providing internal overview and control of all existing workspace activities and their state for use by internal operational staff. | Analyst UI |
| SRDP-L1-6.14 | The system shall authenticate the user and verify authorization prior to creation of a workspace request. | A3 Service, Capability Service |
| SRDP-L1-6.15 | The workspace system shall support the optional submission of jobs to open science grid through the high throughput condor system. | Workflow Service |
| SRDP-L1-8 | Every product shall be assessed for quality, and those products for which the initial calibration are not judged to be of science quality should be identified for further intervention. | Analyst UI, Await QA |
| SRDP-L1-8.6 | Workspaces shall permit some categories of processing to be designated as requiring QA. | Capability Sequence, Await QA |
| SRDP-L1-8.7 | Processing requests that require QA shall have to undergo a human inspection prior to being delivered to the requester or ingested into the archive. | Workspace UI, Capability Sequence, Await QA |
| SRDP-L1-8.8 | There will be a QA interface that will show requests requiring QA and allow designated users to pass/fail requests. | Analyst UI |
| SRDP-L1-8.8.1 | The QA interface will allow permitted users to revise the parameters of a request and submit new processing. | Analyst UI, Workspace UI |
| SRDP-L1-8.8.2 | Only the final QA-passed results will be delivered to the requesting user or ingested into the system. | Capability Sequence, Await QA |
| SRDP-L1-8.9 | The QA interface will facilitate communication between the user performing QA and the user who submitted the processing request. | Workspace UI, Helpdesk Service |
| SRDP-L1-8.10 | Ops staff will be designated for performing QA on standard calibration and imaging processes, and will be able to reassign to other ops staff. | Project Settings, Assignee, Analyst UI |
| SRDP-L1-8.10.1 | Large projects shall be able to designate their own users to perform QA on their processes. | Project Settings, Analyst UI |
| SRDP-L0-11 | The system shall support a robust and reliable process for the testing, validation, and delivery of capabilities. | Testing Plan |
| SRDP-L0-11.2 | SRDP workflows shall be executable with candidate versions of the software. The products generated by this software shall not be exposed as SRDP products in the standard data discovery interfaces. | Capability Matrix |
| SRDP-L0-11.3 | It shall be possible to execute portions of the SRDP workflows to optimize testing. | Testing Plan |
| SRDP-L0-11.4 | It shall be possible to modify the system without losing the current execution state, or in such a way that the state information can be recaptured. | Workflow Service |
| SRDP-L0-11.5 | The execution environment may need to be modified, for example using a non-standard destination directory to accumulate outputs from a regression testing run. | Workflow Service |
| SRDP-L1-11 | Metrics | Metrics Service, Estimation Service |
| SRDP-L1-11.1 | The latency between the completion of the observation and the delivery of products shall be measured. | Metrics Service |
| SRDP-L1-11.2 | Categories for failure shall be identified and metrics derived in order to allow the Observatory to address common failure modes. | Metrics Service, Monitoring Plan |
| SRDP-L1-12.6 | If the requested product is large (either in number of data sets to be processed, or implied processing time), the request shall be flagged for manual review by the SRDP operations staff. | Estimation Service |
| SRDP-L1-13 | The restore use case can be used to prepare data for further processing (such as the PI driven imaging use case). | Future Products |
| SRDP-L1-6.16 | A request is not complete until the user is satisfied with the result of the processing. | Capability Request |
| SRDP-L1-6.16.1 | Multiple revisions of the parameters are permitted and must be kept with the request. | Request Version |
| SRDP-L1-6.16.2 | If a job fails for some transient reason, it should be possible to re-execute it without losing information about the failed execution. | Capability Execution |

* Glossary

** Workspace Terms

A *product* is a set of data files of a particular type, with provenance, which could be archived.

A *capability* is a particular workflow setup, intended to accept a certain kind of product and some parameters and produce another product.

A *workflow* is a non-local process composed of steps, whose currently executing step or steps are known.

A *capability request* marries a capability to a product, representing the expectation of a new product, parameterized in a certain concrete way.

A *capability step* is a step in the process of producing a certain product.

The *capability matrix* maps CASA versions to version-specific templates, allowing us to support a variety of CASA versions.

A *capability queue* organizes requests in priority order and makes it possible to control the number of concurrent executions, or pause execution altogether.

The *project settings* holds project-specific information: custom capabilities, capability template overrides, and a list of users who may perform QA for the custom capabilities of this project.

The capability *step sequence* is the sequence of steps for running a capability. There are only a few now, like /await QA/, /prepare and run workflow/, /await workflow/ and /await product/.

A *capability engine* knows how to walk the step sequence and execute it. There's a number of these for each queue, corresponding to the concurrency limit.

The *capability info* holds the information about capabilities and capability requests.

** NRAO Jargon

- VLASS :: [[https://public.nrao.edu/vlass/][Very Large Array Sky Survey]], which is a large project here at the NRAO to map the radio sky with the modern instrument's capabilities
- CASA :: [[http://casa.nrao.edu][Common Astronomy Software Applications]], is the larger and more modern of the two in-house data reduction packages, for making images from radio data
- HTCondor :: [[https://research.cs.wisc.edu/htcondor/][Condor]] is software for "high throughput computing," which is to say, a kind of grid computing focused on processing smallish jobs on normal-ish computers, in bulk.
- SCG :: The Scientific Computing Group here at the NRAO, are the group that maintains our clusters and worry about grid computing and high-performance and high-throughput computing.
- OSG :: The Open Science Grid is a publicly-funded high-throughput cluster for scientific computing
- CHTC :: The Center for High-Throughput Computing is the research organization at University of Wisconsin that is responsible for maintaining HTCondor and associated software
- NGAS :: Next-Generation Archive System is the previous generation of petabyte-scale data storage which we currently use as the principal storage backend for the archive
- CIPL :: "CASA Integrated PipeLine" is the old name for the standard calibration process

** Technical Jargon

- JWT :: JSON Web Token, a standard for transmitting authentication data between web services.

* COMMENT TODOs
** TODO Make sure requirements satisfaction matrix is propagated to Cameo, create PDF dump from it
** TODO Parameter validation?
** TODO Phase-requirement mapping (plus gap analysis)
** TODO Time criticality - how do we mark things as having high priority? (Manually, until there are requirements otherwise)
** TODO Add something about Cancellation from DI 2.6 and 9.
** TODO Discuss scalability
** TODO VLASS: Show mapping from Product Type -> Product -> Product Version -> Request/Job
** TODO Metrics, which, how stored and queried, informing staff about stalled jobs
** TODO Test plan: full environment in CV? automated regression testing? what is in scope for regression testing?
** TODO What happens to follow-ons whose predecessor is cancelled?
Probably they get cancelled as well.
** TODO Are queues dynamic? (YES)
** TODO Is there a way to pause all queues at once? (CH) should be
** TODO
* COMMENT Buckets
** More discussion/documentation needed
*** ALMA (RR)
Is there a plan to scale the workspace design to include ALMA data? If so, some general considerations are:
- Does this mean that there are two Archives, one in NM and one in CV? Will the one in NM only have VLASS and the one in CV only have ALMA?
- Will processing in CV happen on HTCondor or on the NAASC lustre? And if processing is happening on the NAASC lustre, how is the messaging and state system incorporated? Can NAASC users choose where jobs are submitted and how will this work?

*** Section 4 (RR)
- Capability Request: It states that it can hold multiple versions. For a given request, does there need to be a 1:1 ratio of versions to executions? And can multiple executions on a single data set run concurrently?
- Request version: How many attempts to produce a result and what predicates a susccessful result? Is there a way to stop trying to produce a result based on the failure?
- Capability queue: “all the executions” each having its own version, correct?
*** Section 4.0.1:
- Prepare and run workflow: you only have to make sure two workflows on the same execution are not occurring simultaneously
*** Section 7.3:
- Helpdesks will be moving away from Kayako. How will workspaces handle this? Does it matter?
- Workspace UI and workflow should also be hooked up to the helpdesk?
*** Section 8:
- "Architecturally, how capabilities and workflows and whatnot are persisted is not especially significant”. This is perhaps untrue because it is important when trying to trace a failure through the system

** Addressed in Documentation
***
** Need Further Elaboration
*** Scope failure modes (RR)
One of my concerns is a general underestimation of scope failure modes and how to handle them. For example:

- What happens if CASA silently fails?
- What happens if Pipeline fails and how do we know if it is really Pipeline or a propagated error from CASA?
(Knowing the answers to a given failure should then dictate the next step in the workflow. Is it submitted again? Does it go to manual?)
- What happens if a message is sent but not received?
- What happens if some, but not all, the products are generated?

There needs to be a way to detect some of these other failure modes, sufficient logging and traceable to find the root cause, and then a course of action to deal with it.

The seemed to be little discussion on manual processing in the event of standard mode failure. It is understood that users may submit jobs, but what happens in the instance that a standard mode fails?
- Does the job get resubmitted or sent to a DA? (It probably depends on the type of failure, see above)
- If it is sent to a DA, does it still get ingested?
- What happens if a user (PI or DA) generates better results than the standard mode? Are they re-ingested?