Add ARCHITECTURE.md for future products

Separate architecture from readme in delivery

Add ARCHITECTURE.md for future products
b3969d54 · Daniel Lyons · Daniel Lyons · 19eb6c16 · b3969d54 · b3969d54
Commit b3969d54 authored 4 years ago by Daniel Lyons Committed by Daniel Lyons 4 years ago
--- a/apps/cli/executables/delivery/ARCHITECTURE.md
+++ b/apps/cli/executables/delivery/ARCHITECTURE.md
+# Delivery Architecture
+
+What is delivery? Delivery is what happens after the the active processing portion of the workflow concludes. It is the
+step that moves the retrieved or generated products from the processing area to a place where they can be accessed by
+the requesting user.
+
+Most workflows proceed by retrieving some files from NGAS and running CASA on those files to produce new products. The
+files are large and CASA is quite heavy, so we retrieve the files into a spool area on the Lustre filesystem and then
+launch the CASA jobs on the cluster. Once CASA is finished, the files the user wants are still sitting in that spool
+area on Lustre. Delivery is what gets the files from there to where the user can retrieve them.
+
+The simplest kind of delivery is just copying files from the spool area to another location—a mere `cp`. However, we
+have several complications:
+
+- CASA mandates a certain filesystem layout for the spool area
+- The filesystem layout of the delivery destination varies based on the _type_ of the product
+- Users can request `tar` archives, optionally
+- Users can request delivery to their own areas in Lustre
+- Not specifying a delivery location implies creating a unique location under a web root
+
+We also want to be somewhat flexible in case new streaming kinds of deliveries are mandated in the future, such as
+Globus (formerly GridFTP).
+
+The result is that the behavior of the delivery process, which is fundamentally `cp`, varies both according to options
+given by the user and various facts about the data we happen to be delivering.
+
+## Handling files
+
+At the bottom of every delivery process is a process of being supplied files and told to deliver them. The
+_Destination_ system is the core of this portion of the process. The goal here is to decouple the idea of "here is a
+file to deliver" from the details of how that delivery happens. We have one concrete class here, `LocalDestination`,
+which represents the common `cp` case of copying a file into the destination. If the simplest delivery
+is `cp source dest`, you can think of `LocalDestination` as embodying the idea of `cp ... dest`.
+
+The _Destination_ classes make no sense on their own, their purpose is to be passed around to other objects in the
+system that know about files that need to be delivered. The _Destination_ classes just hide the details about where
+those files are actually going and how they're getting there.
+
+If we were going to support something like Globus, I expect it would appear as a peer of `LocalDestination`, as another
+concrete implementation of `Destination`.
+
+### Checksums and compression
+
+Thinking along these lines, one can think of checksums as the construction of another file to be added to the
+destination. In fact, `Destination` is 1) handed every file to be delivered, and 2) knows where the files are ultimately
+going to be placed, we can see a way to handle creating a checksum file as a kind of "pass-through" step that happens
+automatically. The algorithm would look something like this:
+
+1. Make a checksum wrapper for the local destination
+2. For every file we get asked to deliver, calculate its checksum before handing it off to the wrapped destination for
+   delivery
+3. After we are done delivering files, pass a fake file containing the checksums to the wrapped destination
+
+This kind of "wrapper" or "pass-through" thing happens often enough in object-oriented programming that it is called
+the "Decorator pattern." We can handle compression the same way:
+
+1. Make a tar archive in a scratch area somewhere
+2. For every file we get asked to deliver, instead place it in the archive in the scratch area
+3. After we are done delivering files, finalize the archive and pass it to the wrapped destination
+
+The key idea here is that the next part of the system which finds files to deliver has _no idea_ about whether we are
+using compression or calculating checksums or not—in fact, these wrappers are stackable. The part of the system that
+finds files to deliver just passes them to the destination, and as long as the stack of wrappers and destinations has
+been constructed by someone in the right order, everything will happen as it should.
+
+The purpose of the `DestinationBuilder` is to ensure that the stack is constructed in the right way. The reason
+`Destination` has a `close()` method is for these wrappers to know when we are done delivering files so they can take
+their finalization steps.
+
+## Handling products
+
+If you look at
+the [delivery directory requirements](https://open-confluence.nrao.edu/display/SPR/Delivery+Directory+Improvements),
+you'll see that there are a number of requirements to group things together based on their project or their telescope,
+and the directory names are based on the type of product. Knowing what you have in hand affects the layout in the
+delivery directory. This means that we are not always going to have a straightforward `cp` command, because the way
+files rest in the spool area doesn't necessarily match the way that they need to be laid out in the delivery directory.
+
+The key idea here is that somebody, eventually, knows what _they_ are, and the knowledge about how that _type_ is
+delivered should live with that _type_, rather than being spread around the system. Execution blocks should know what
+execution blocks are supposed to look like when they get delivered; images should know what images should look like when
+they are delivered, and so forth. If a new type of product is invented, supporting a wacky delivery format for that
+product should be a matter of defining that product type and adding the logic just to that product. This is why we have
+a `SpooledProduct` with a single method: `deliver_to(Destination)`. We expect to have a driver that at some level is
+passing a destination to each of these products and saying, "write yourself to this destination."
+
+This suggests that when we are saying "deliver from here to there," we are not saying the same thing as `cp`, which is
+saying "copy these files from here to there" but actually we're saying "copy all the products from here to there,
+according to how each of these products _should_ be copied." In the beginning, a simple product like an execution block
+_will_
+simply deliver the files in its directory directly, but as we support more complex products like OUS requests with
+images, more interesting things will happen.
+
+## Finding products
+
+How will we know what the products are that need to be delivered? We can assume we are given a source directory with
+products in it, but how do we enumerate them in order to deliver them? The most straightforward answer is we can simply
+iterate the entire directory and match filename patterns with product types; if it ends with `.ms` it's a measurement
+set, if it looks like `PPR.xml` it's a pipeline request, etc. Doing this amounts to having a dispatch table of common
+filename patterns, which is tedious, but exhaustive and gives our code a fair amount of control.
+
+There is a second way to figure out the products, which is by examining CASA's `piperesults` output file. This file
+isn't necessarily present (after all, CASA is not _required_ for every workflow) so this method cannot ever be the
+_only_ means of determining the products. But it may eventually be a requirement that we support using the
+`piperesults` file. So rather than having a single class here called `ProductFinder`, we instead have an interface
+called `ProductFinder` and a `HeuristicProductFinder` that does the filename dispatch approach and a
+`PiperesultsProductFinder` that uses the `piperesults` file to figure it out.
+
+## Bringing it all together
+
+So we have a system that finds products, products that know how to write themselves to a destination, and
+destinations that know how to handle local filesystem writes, compression and checksumming. This is most of what is
+needed. We can see now that we want to have a main loop that looks like this:
+
+    for product in finder.find_products():
+      product.deliver_to(destination)
+
+What is still missing is a small amount of plumbing to get us from here to there. We need a device for processing
+the command line arguments. Some aspects of delivery are based on user-supplied options: whether we are do tar
+archives or not, whether we are delivering the raw data retrieved by the data fetcher or the products generated by
+CASA. Eventually we will have to support a local delivery command line option. Basically, anything the user chooses
+in the archive UI that affects delivery is going to arrive to us through the command line options. So we have to add
+a command line parser, which we have in `Context`.
+
+A few lessons-learned type things from the legacy delivery system are also in the `Context`. We assume that a few
+"services" are available in `Context` to the `Destination` and `ProductFinder` schemes. For web delivery, we will
+eventually need to be able to generate random codes for the URL, but we want those random codes to be stable
+throughout the delivery process, so there is a way to do that in the `Context`. Also creating temporary files is
+provided via the `Context`, which is something the tar and checksum wrappers will eventually need. So the `Context`
+is available to these classes at construction time so they can call these services as needed, or peek at command
+line arguments they may care about.
+
+And that's the theory behind delivery in a nutshell.
--- a/apps/cli/executables/delivery/README.md
+++ b/apps/cli/executables/delivery/README.md
@@ -8,138 +8,4 @@ This is the delivery thing.
 https://open-confluence.nrao.edu/display/AAT/Proposed+Delivery+Redesign
 https://open-confluence.nrao.edu/display/SPR/Delivery+Directory+Improvements

-->
-
-## Theory
-
-What is delivery? Delivery is what happens after the the active processing portion of the workflow concludes. It is the
-step that moves the retrieved or generated products from the processing area to a place where they can be accessed by
-the requesting user.
-
-Most workflows proceed by retrieving some files from NGAS and running CASA on those files to produce new products. The
-files are large and CASA is quite heavy, so we retrieve the files into a spool area on the Lustre filesystem and then
-launch the CASA jobs on the cluster. Once CASA is finished, the files the user wants are still sitting in that spool
-area on Lustre. Delivery is what gets the files from there to where the user can retrieve them.
-
-The simplest kind of delivery is just copying files from the spool area to another location—a mere `cp`. However, we
-have several complications:
-
- CASA mandates a certain filesystem layout for the spool area
- The filesystem layout of the delivery destination varies based on the _type_ of the product
- Users can request `tar` archives, optionally
- Users can request delivery to their own areas in Lustre
- Not specifying a delivery location implies creating a unique location under a web root
-
-We also want to be somewhat flexible in case new streaming kinds of deliveries are mandated in the future, such as
-Globus (formerly GridFTP).
-
-The result is that the behavior of the delivery process, which is fundamentally `cp`, varies both according to options
-given by the user and various facts about the data we happen to be delivering.
-
-### Handling files
-
-At the bottom of every delivery process is a process of being supplied files and told to deliver them. The
-_Destination_ system is the core of this portion of the process. The goal here is to decouple the idea of "here is a
-file to deliver" from the details of how that delivery happens. We have one concrete class here, `LocalDestination`,
-which represents the common `cp` case of copying a file into the destination. If the simplest delivery
-is `cp source dest`, you can think of `LocalDestination` as embodying the idea of `cp ... dest`.
-
-The _Destination_ classes make no sense on their own, their purpose is to be passed around to other objects in the
-system that know about files that need to be delivered. The _Destination_ classes just hide the details about where
-those files are actually going and how they're getting there.
-
-If we were going to support something like Globus, I expect it would appear as a peer of `LocalDestination`, as another
-concrete implementation of `Destination`.
-
-#### Checksums and compression
-
-Thinking along these lines, one can think of checksums as the construction of another file to be added to the
-destination. In fact, `Destination` is 1) handed every file to be delivered, and 2) knows where the files are ultimately
-going to be placed, we can see a way to handle creating a checksum file as a kind of "pass-through" step that happens
-automatically. The algorithm would look something like this:
-
-1. Make a checksum wrapper for the local destination
-2. For every file we get asked to deliver, calculate its checksum before handing it off to the wrapped destination for
-   delivery
-3. After we are done delivering files, pass a fake file containing the checksums to the wrapped destination
-
-This kind of "wrapper" or "pass-through" thing happens often enough in object-oriented programming that it is called
-the "Decorator pattern." We can handle compression the same way:
-
-1. Make a tar archive in a scratch area somewhere
-2. For every file we get asked to deliver, instead place it in the archive in the scratch area
-3. After we are done delivering files, finalize the archive and pass it to the wrapped destination
-
-The key idea here is that the next part of the system which finds files to deliver has _no idea_ about whether we are
-using compression or calculating checksums or not—in fact, these wrappers are stackable. The part of the system that
-finds files to deliver just passes them to the destination, and as long as the stack of wrappers and destinations has
-been constructed by someone in the right order, everything will happen as it should.
-
-The purpose of the `DestinationBuilder` is to ensure that the stack is constructed in the right way. The reason
-`Destination` has a `close()` method is for these wrappers to know when we are done delivering files so they can take
-their finalization steps.
-
-### Handling products
-
-If you look at
-the [delivery directory requirements](https://open-confluence.nrao.edu/display/SPR/Delivery+Directory+Improvements),
-you'll see that there are a number of requirements to group things together based on their project or their telescope,
-and the directory names are based on the type of product. Knowing what you have in hand affects the layout in the
-delivery directory. This means that we are not always going to have a straightforward `cp` command, because the way
-files rest in the spool area doesn't necessarily match the way that they need to be laid out in the delivery directory.
-
-The key idea here is that somebody, eventually, knows what _they_ are, and the knowledge about how that _type_ is
-delivered should live with that _type_, rather than being spread around the system. Execution blocks should know what
-execution blocks are supposed to look like when they get delivered; images should know what images should look like when
-they are delivered, and so forth. If a new type of product is invented, supporting a wacky delivery format for that
-product should be a matter of defining that product type and adding the logic just to that product. This is why we have
-a `SpooledProduct` with a single method: `deliver_to(Destination)`. We expect to have a driver that at some level is
-passing a destination to each of these products and saying, "write yourself to this destination."
-
-This suggests that when we are saying "deliver from here to there," we are not saying the same thing as `cp`, which is
-saying "copy these files from here to there" but actually we're saying "copy all the products from here to there,
-according to how each of these products _should_ be copied." In the beginning, a simple product like an execution block
-_will_
-simply deliver the files in its directory directly, but as we support more complex products like OUS requests with
-images, more interesting things will happen.
-
-### Finding products
-
-How will we know what the products are that need to be delivered? We can assume we are given a source directory with
-products in it, but how do we enumerate them in order to deliver them? The most straightforward answer is we can simply
-iterate the entire directory and match filename patterns with product types; if it ends with `.ms` it's a measurement
-set, if it looks like `PPR.xml` it's a pipeline request, etc. Doing this amounts to having a dispatch table of common
-filename patterns, which is tedious, but exhaustive and gives our code a fair amount of control.
-
-There is a second way to figure out the products, which is by examining CASA's `piperesults` output file. This file 
-isn't necessarily present (after all, CASA is not _required_ for every workflow) so this method cannot ever be the 
-_only_ means of determining the products. But it may eventually be a requirement that we support using the 
-`piperesults` file. So rather than having a single class here called `ProductFinder`, we instead have an interface 
-called `ProductFinder` and a `HeuristicProductFinder` that does the filename dispatch approach and a 
-`PiperesultsProductFinder` that uses the `piperesults` file to figure it out.
-
-### Bringing it all together
-
-So we have a system that finds products, products that know how to write themselves to a destination, and 
-destinations that know how to handle local filesystem writes, compression and checksumming. This is most of what is 
-needed. We can see now that we want to have a main loop that looks like this:
-
-    for product in finder.find_products():
-      product.deliver_to(destination)
-
-What is still missing is a small amount of plumbing to get us from here to there. We need a device for processing 
-the command line arguments. Some aspects of delivery are based on user-supplied options: whether we are do tar 
-archives or not, whether we are delivering the raw data retrieved by the data fetcher or the products generated by 
-CASA. Eventually we will have to support a local delivery command line option. Basically, anything the user chooses 
-in the archive UI that affects delivery is going to arrive to us through the command line options. So we have to add 
-a command line parser, which we have in `Context`. 
-
-A few lessons-learned type things from the legacy delivery system are also in the `Context`. We assume that a few 
-"services" are available in `Context` to the `Destination` and `ProductFinder` schemes. For web delivery, we will 
-eventually need to be able to generate random codes for the URL, but we want those random codes to be stable 
-throughout the delivery process, so there is a way to do that in the `Context`. Also creating temporary files is 
-provided via the `Context`, which is something the tar and checksum wrappers will eventually need. So the `Context` 
-is available to these classes at construction time so they can call these services as needed, or peek at command 
-line arguments they may care about.
-
-And that's the theory behind delivery in a nutshell.
+-->
\ No newline at end of file
--- a/shared/workspaces/workspaces/ARCHITECTURE.md
+++ b/shared/workspaces/workspaces/ARCHITECTURE.md
+# Future Product Architecture
+
+When I request data from the archive, the archive goes and retrieves it. What about if I want to retrieve the data, then
+calibrate it, then split the data into several pieces and image each piece? Each of those requests is an independent _
+capability request_, but how do we arrange for them to occur without making the user wait for the first request to
+complete before doing the second, and the second and then the third?
+
+How can the DAs set up a calibration pipeline, so that whenever a new piece of data is archived, it triggers a
+calibration capability? How can a user arrange for their custom processing steps to be executed as soon as their data is
+ingested into the archive?
+
+The answer to these questions is the future product system.
+
+## What is a product?
+
+To the archive, a product is something with a _science product locator_ in the archive. Science product locators are
+URIs and look like this: `uid://alma/execblock/39113b0e-2f2e-4a73-8c95-c75c18b98ee9`.
+
+Every product locator string contains the instrument that generated the product, the "type" of the product that was
+generated (`execblock` and `calibration` are common possibilities) and then a globally unique identifier (GUID). If you
+have this locator, you have the master key for the archive for this product and can access it and cause it to be
+downloaded.
+
+It is **extremely important** that you not try to parse the product locator. They to be treated as opaque identifiers,
+except to the product locator service in the archive.
+
+## What is a future product?
+
+The future product system is a way of referring to products that already exist as well as products that do not yet
+exist. Future products exist in one of two states: resolved or not-yet-resolved. The "base case" of the type of future
+products wraps a science product locator and is resolved. There is a second "base case": An external execution block
+identifier, which is resolved once that execution block is ingested and the science product locator is obtained. Finally
+there is an inductive type: a capability request, which is resolved once the request is complete.
+
+The capability-request future product is what allows the creation of follow-on requests. Each time a follow-on request
+is generated, it is generated against a future product of the original request. If this is done a second time, you get a
+follow-on for a follow-on request. In this way you can construct a calibration which leads to some kind of data split,
+which leads in a tree-like fashion to a number of imaging requests.
+
+If the original request is for an external execution block identifier, then you have the beginning of an integrated
+calibration pipeline system, where the system is expecting an execution block to be ingested, and expecting to calibrate
+it once that happens. The user ought to then be able to base (say) a custom imaging request as a follow-on for the
+standard calibration, and not have to do anything reactively as the execblock or calibration become available in the
+archive.
+
+## How does it work?
+
+When a future product is defined, it begins life in either the resolved or unresolved states, depending on whether it is
+wrapping a product locator (resolved) or one of the other types (unresolved).
+
+A new system exists in the capability service, the future product system, which keeps track of future products. Each
+product is tracked by its original definition and its current state of resolution. As new requests enter the system, if
+they reuse a definition, they become linked because they share the original definition.
+
+The future product system is receiving events from the archive as products are ingested, and events from the capability
+system as requests reach completion. It matches events from these sources to future products that are awaiting events.
+
+At the moment a future product becomes resolved, an event is generated by the future product system which is broadcast
+to anyone that might be interested. The capability system is generally going to be interested in these events, so when
+they come in, the capability service will see if it has any capability requests sitting at the AWAIT PRODUCT stage,
+waiting on this future product. If there is an AWAIT PRODUCT for this product, that step is now completed and the
+capability step sequence advances to the next step.
+
+## Examples
+
+
+### Existing product
+1. A capability request is made using a science product locator. The future product wraps the science product locator and it starts life resolved. 
+   
+3. The AWAIT PRODUCT step sees that the product is already resolved, and the engine advances to the
+next step in the sequence immediately.
+
+### Follow-on request
+
+1. A capability request is made as a follow on against the previous request. The future product is a capability-request
+future product; it is defined by wrapping the capability request identifier and begins life unresolved. 
+   
+2. The AWAIT PRODUCT step notices that the future product is unresolved and leaves the request in a dormant state. 
+   
+3. When the previous request completes, the future product system receives the event from the capability system that 
+   a capability request completed. 
+   
+4. The future product system matches the request-complete event to the future product that was awaiting it. The
+current-state-of-resolution is updated to a reference to that capability request's result. There are no more resolution
+steps required, so the future product is now resolved. 
+   
+5. The future product system emits an event that this future product is resolved and here is the resolution. 
+   
+6. The capability system receives that event and matches it to the request in the AWAIT PRODUCT state. The step is 
+   now complete and the engine advances the request to the next step.
+
+### External execution block identifier
+
+1. A capability request is made against an external execution block identifier. The future product is an
+external-execution-block-id future product; it begins life unresolved. 
+   
+2. The AWAIT PRODUCT step notices that it is unresolved and leaves the request in a dormant state. 
+   
+3. When the execution block is ingested, an event enters the future product system with the science product locator. 
+   The future product is located and the current-state-of-resolution is advanced to this science product locator, 
+   and the product becomes resolved. 
+   
+4. The future product system emits an event with this resolution.
+   
+5. The capability system receives the event and matches to the request in the AWAIT PRODUCT state. The step is now 
+   complete and the engine advances the request to the next step.