diff --git a/apps/cli/executables/delivery/ARCHITECTURE.md b/apps/cli/executables/delivery/ARCHITECTURE.md new file mode 100644 index 0000000000000000000000000000000000000000..8de4ecbae4f480420925b6d719c0f343d6789ae4 --- /dev/null +++ b/apps/cli/executables/delivery/ARCHITECTURE.md @@ -0,0 +1,133 @@ +# Delivery Architecture + +What is delivery? Delivery is what happens after the the active processing portion of the workflow concludes. It is the +step that moves the retrieved or generated products from the processing area to a place where they can be accessed by +the requesting user. + +Most workflows proceed by retrieving some files from NGAS and running CASA on those files to produce new products. The +files are large and CASA is quite heavy, so we retrieve the files into a spool area on the Lustre filesystem and then +launch the CASA jobs on the cluster. Once CASA is finished, the files the user wants are still sitting in that spool +area on Lustre. Delivery is what gets the files from there to where the user can retrieve them. + +The simplest kind of delivery is just copying files from the spool area to another location—a mere `cp`. However, we +have several complications: + +- CASA mandates a certain filesystem layout for the spool area +- The filesystem layout of the delivery destination varies based on the _type_ of the product +- Users can request `tar` archives, optionally +- Users can request delivery to their own areas in Lustre +- Not specifying a delivery location implies creating a unique location under a web root + +We also want to be somewhat flexible in case new streaming kinds of deliveries are mandated in the future, such as +Globus (formerly GridFTP). + +The result is that the behavior of the delivery process, which is fundamentally `cp`, varies both according to options +given by the user and various facts about the data we happen to be delivering. + +## Handling files + +At the bottom of every delivery process is a process of being supplied files and told to deliver them. The +_Destination_ system is the core of this portion of the process. The goal here is to decouple the idea of "here is a +file to deliver" from the details of how that delivery happens. We have one concrete class here, `LocalDestination`, +which represents the common `cp` case of copying a file into the destination. If the simplest delivery +is `cp source dest`, you can think of `LocalDestination` as embodying the idea of `cp ... dest`. + +The _Destination_ classes make no sense on their own, their purpose is to be passed around to other objects in the +system that know about files that need to be delivered. The _Destination_ classes just hide the details about where +those files are actually going and how they're getting there. + +If we were going to support something like Globus, I expect it would appear as a peer of `LocalDestination`, as another +concrete implementation of `Destination`. + +### Checksums and compression + +Thinking along these lines, one can think of checksums as the construction of another file to be added to the +destination. In fact, `Destination` is 1) handed every file to be delivered, and 2) knows where the files are ultimately +going to be placed, we can see a way to handle creating a checksum file as a kind of "pass-through" step that happens +automatically. The algorithm would look something like this: + +1. Make a checksum wrapper for the local destination +2. For every file we get asked to deliver, calculate its checksum before handing it off to the wrapped destination for + delivery +3. After we are done delivering files, pass a fake file containing the checksums to the wrapped destination + +This kind of "wrapper" or "pass-through" thing happens often enough in object-oriented programming that it is called +the "Decorator pattern." We can handle compression the same way: + +1. Make a tar archive in a scratch area somewhere +2. For every file we get asked to deliver, instead place it in the archive in the scratch area +3. After we are done delivering files, finalize the archive and pass it to the wrapped destination + +The key idea here is that the next part of the system which finds files to deliver has _no idea_ about whether we are +using compression or calculating checksums or not—in fact, these wrappers are stackable. The part of the system that +finds files to deliver just passes them to the destination, and as long as the stack of wrappers and destinations has +been constructed by someone in the right order, everything will happen as it should. + +The purpose of the `DestinationBuilder` is to ensure that the stack is constructed in the right way. The reason +`Destination` has a `close()` method is for these wrappers to know when we are done delivering files so they can take +their finalization steps. + +## Handling products + +If you look at +the [delivery directory requirements](https://open-confluence.nrao.edu/display/SPR/Delivery+Directory+Improvements), +you'll see that there are a number of requirements to group things together based on their project or their telescope, +and the directory names are based on the type of product. Knowing what you have in hand affects the layout in the +delivery directory. This means that we are not always going to have a straightforward `cp` command, because the way +files rest in the spool area doesn't necessarily match the way that they need to be laid out in the delivery directory. + +The key idea here is that somebody, eventually, knows what _they_ are, and the knowledge about how that _type_ is +delivered should live with that _type_, rather than being spread around the system. Execution blocks should know what +execution blocks are supposed to look like when they get delivered; images should know what images should look like when +they are delivered, and so forth. If a new type of product is invented, supporting a wacky delivery format for that +product should be a matter of defining that product type and adding the logic just to that product. This is why we have +a `SpooledProduct` with a single method: `deliver_to(Destination)`. We expect to have a driver that at some level is +passing a destination to each of these products and saying, "write yourself to this destination." + +This suggests that when we are saying "deliver from here to there," we are not saying the same thing as `cp`, which is +saying "copy these files from here to there" but actually we're saying "copy all the products from here to there, +according to how each of these products _should_ be copied." In the beginning, a simple product like an execution block +_will_ +simply deliver the files in its directory directly, but as we support more complex products like OUS requests with +images, more interesting things will happen. + +## Finding products + +How will we know what the products are that need to be delivered? We can assume we are given a source directory with +products in it, but how do we enumerate them in order to deliver them? The most straightforward answer is we can simply +iterate the entire directory and match filename patterns with product types; if it ends with `.ms` it's a measurement +set, if it looks like `PPR.xml` it's a pipeline request, etc. Doing this amounts to having a dispatch table of common +filename patterns, which is tedious, but exhaustive and gives our code a fair amount of control. + +There is a second way to figure out the products, which is by examining CASA's `piperesults` output file. This file +isn't necessarily present (after all, CASA is not _required_ for every workflow) so this method cannot ever be the +_only_ means of determining the products. But it may eventually be a requirement that we support using the +`piperesults` file. So rather than having a single class here called `ProductFinder`, we instead have an interface +called `ProductFinder` and a `HeuristicProductFinder` that does the filename dispatch approach and a +`PiperesultsProductFinder` that uses the `piperesults` file to figure it out. + +## Bringing it all together + +So we have a system that finds products, products that know how to write themselves to a destination, and +destinations that know how to handle local filesystem writes, compression and checksumming. This is most of what is +needed. We can see now that we want to have a main loop that looks like this: + + for product in finder.find_products(): + product.deliver_to(destination) + +What is still missing is a small amount of plumbing to get us from here to there. We need a device for processing +the command line arguments. Some aspects of delivery are based on user-supplied options: whether we are do tar +archives or not, whether we are delivering the raw data retrieved by the data fetcher or the products generated by +CASA. Eventually we will have to support a local delivery command line option. Basically, anything the user chooses +in the archive UI that affects delivery is going to arrive to us through the command line options. So we have to add +a command line parser, which we have in `Context`. + +A few lessons-learned type things from the legacy delivery system are also in the `Context`. We assume that a few +"services" are available in `Context` to the `Destination` and `ProductFinder` schemes. For web delivery, we will +eventually need to be able to generate random codes for the URL, but we want those random codes to be stable +throughout the delivery process, so there is a way to do that in the `Context`. Also creating temporary files is +provided via the `Context`, which is something the tar and checksum wrappers will eventually need. So the `Context` +is available to these classes at construction time so they can call these services as needed, or peek at command +line arguments they may care about. + +And that's the theory behind delivery in a nutshell. diff --git a/apps/cli/executables/delivery/README.md b/apps/cli/executables/delivery/README.md index 338f6f7d934bd464a33a5b140d18d1cd8518a869..0e3bfa6bc6b769c20d104f14f739adffe30eda44 100644 --- a/apps/cli/executables/delivery/README.md +++ b/apps/cli/executables/delivery/README.md @@ -8,138 +8,4 @@ This is the delivery thing. https://open-confluence.nrao.edu/display/AAT/Proposed+Delivery+Redesign https://open-confluence.nrao.edu/display/SPR/Delivery+Directory+Improvements ---> - -## Theory - -What is delivery? Delivery is what happens after the the active processing portion of the workflow concludes. It is the -step that moves the retrieved or generated products from the processing area to a place where they can be accessed by -the requesting user. - -Most workflows proceed by retrieving some files from NGAS and running CASA on those files to produce new products. The -files are large and CASA is quite heavy, so we retrieve the files into a spool area on the Lustre filesystem and then -launch the CASA jobs on the cluster. Once CASA is finished, the files the user wants are still sitting in that spool -area on Lustre. Delivery is what gets the files from there to where the user can retrieve them. - -The simplest kind of delivery is just copying files from the spool area to another location—a mere `cp`. However, we -have several complications: - -- CASA mandates a certain filesystem layout for the spool area -- The filesystem layout of the delivery destination varies based on the _type_ of the product -- Users can request `tar` archives, optionally -- Users can request delivery to their own areas in Lustre -- Not specifying a delivery location implies creating a unique location under a web root - -We also want to be somewhat flexible in case new streaming kinds of deliveries are mandated in the future, such as -Globus (formerly GridFTP). - -The result is that the behavior of the delivery process, which is fundamentally `cp`, varies both according to options -given by the user and various facts about the data we happen to be delivering. - -### Handling files - -At the bottom of every delivery process is a process of being supplied files and told to deliver them. The -_Destination_ system is the core of this portion of the process. The goal here is to decouple the idea of "here is a -file to deliver" from the details of how that delivery happens. We have one concrete class here, `LocalDestination`, -which represents the common `cp` case of copying a file into the destination. If the simplest delivery -is `cp source dest`, you can think of `LocalDestination` as embodying the idea of `cp ... dest`. - -The _Destination_ classes make no sense on their own, their purpose is to be passed around to other objects in the -system that know about files that need to be delivered. The _Destination_ classes just hide the details about where -those files are actually going and how they're getting there. - -If we were going to support something like Globus, I expect it would appear as a peer of `LocalDestination`, as another -concrete implementation of `Destination`. - -#### Checksums and compression - -Thinking along these lines, one can think of checksums as the construction of another file to be added to the -destination. In fact, `Destination` is 1) handed every file to be delivered, and 2) knows where the files are ultimately -going to be placed, we can see a way to handle creating a checksum file as a kind of "pass-through" step that happens -automatically. The algorithm would look something like this: - -1. Make a checksum wrapper for the local destination -2. For every file we get asked to deliver, calculate its checksum before handing it off to the wrapped destination for - delivery -3. After we are done delivering files, pass a fake file containing the checksums to the wrapped destination - -This kind of "wrapper" or "pass-through" thing happens often enough in object-oriented programming that it is called -the "Decorator pattern." We can handle compression the same way: - -1. Make a tar archive in a scratch area somewhere -2. For every file we get asked to deliver, instead place it in the archive in the scratch area -3. After we are done delivering files, finalize the archive and pass it to the wrapped destination - -The key idea here is that the next part of the system which finds files to deliver has _no idea_ about whether we are -using compression or calculating checksums or not—in fact, these wrappers are stackable. The part of the system that -finds files to deliver just passes them to the destination, and as long as the stack of wrappers and destinations has -been constructed by someone in the right order, everything will happen as it should. - -The purpose of the `DestinationBuilder` is to ensure that the stack is constructed in the right way. The reason -`Destination` has a `close()` method is for these wrappers to know when we are done delivering files so they can take -their finalization steps. - -### Handling products - -If you look at -the [delivery directory requirements](https://open-confluence.nrao.edu/display/SPR/Delivery+Directory+Improvements), -you'll see that there are a number of requirements to group things together based on their project or their telescope, -and the directory names are based on the type of product. Knowing what you have in hand affects the layout in the -delivery directory. This means that we are not always going to have a straightforward `cp` command, because the way -files rest in the spool area doesn't necessarily match the way that they need to be laid out in the delivery directory. - -The key idea here is that somebody, eventually, knows what _they_ are, and the knowledge about how that _type_ is -delivered should live with that _type_, rather than being spread around the system. Execution blocks should know what -execution blocks are supposed to look like when they get delivered; images should know what images should look like when -they are delivered, and so forth. If a new type of product is invented, supporting a wacky delivery format for that -product should be a matter of defining that product type and adding the logic just to that product. This is why we have -a `SpooledProduct` with a single method: `deliver_to(Destination)`. We expect to have a driver that at some level is -passing a destination to each of these products and saying, "write yourself to this destination." - -This suggests that when we are saying "deliver from here to there," we are not saying the same thing as `cp`, which is -saying "copy these files from here to there" but actually we're saying "copy all the products from here to there, -according to how each of these products _should_ be copied." In the beginning, a simple product like an execution block -_will_ -simply deliver the files in its directory directly, but as we support more complex products like OUS requests with -images, more interesting things will happen. - -### Finding products - -How will we know what the products are that need to be delivered? We can assume we are given a source directory with -products in it, but how do we enumerate them in order to deliver them? The most straightforward answer is we can simply -iterate the entire directory and match filename patterns with product types; if it ends with `.ms` it's a measurement -set, if it looks like `PPR.xml` it's a pipeline request, etc. Doing this amounts to having a dispatch table of common -filename patterns, which is tedious, but exhaustive and gives our code a fair amount of control. - -There is a second way to figure out the products, which is by examining CASA's `piperesults` output file. This file -isn't necessarily present (after all, CASA is not _required_ for every workflow) so this method cannot ever be the -_only_ means of determining the products. But it may eventually be a requirement that we support using the -`piperesults` file. So rather than having a single class here called `ProductFinder`, we instead have an interface -called `ProductFinder` and a `HeuristicProductFinder` that does the filename dispatch approach and a -`PiperesultsProductFinder` that uses the `piperesults` file to figure it out. - -### Bringing it all together - -So we have a system that finds products, products that know how to write themselves to a destination, and -destinations that know how to handle local filesystem writes, compression and checksumming. This is most of what is -needed. We can see now that we want to have a main loop that looks like this: - - for product in finder.find_products(): - product.deliver_to(destination) - -What is still missing is a small amount of plumbing to get us from here to there. We need a device for processing -the command line arguments. Some aspects of delivery are based on user-supplied options: whether we are do tar -archives or not, whether we are delivering the raw data retrieved by the data fetcher or the products generated by -CASA. Eventually we will have to support a local delivery command line option. Basically, anything the user chooses -in the archive UI that affects delivery is going to arrive to us through the command line options. So we have to add -a command line parser, which we have in `Context`. - -A few lessons-learned type things from the legacy delivery system are also in the `Context`. We assume that a few -"services" are available in `Context` to the `Destination` and `ProductFinder` schemes. For web delivery, we will -eventually need to be able to generate random codes for the URL, but we want those random codes to be stable -throughout the delivery process, so there is a way to do that in the `Context`. Also creating temporary files is -provided via the `Context`, which is something the tar and checksum wrappers will eventually need. So the `Context` -is available to these classes at construction time so they can call these services as needed, or peek at command -line arguments they may care about. - -And that's the theory behind delivery in a nutshell. +--> \ No newline at end of file diff --git a/shared/workspaces/workspaces/ARCHITECTURE.md b/shared/workspaces/workspaces/ARCHITECTURE.md new file mode 100644 index 0000000000000000000000000000000000000000..f6ca0e5f6438467a7c762f53e31585a1a5228e87 --- /dev/null +++ b/shared/workspaces/workspaces/ARCHITECTURE.md @@ -0,0 +1,106 @@ +# Future Product Architecture + +When I request data from the archive, the archive goes and retrieves it. What about if I want to retrieve the data, then +calibrate it, then split the data into several pieces and image each piece? Each of those requests is an independent _ +capability request_, but how do we arrange for them to occur without making the user wait for the first request to +complete before doing the second, and the second and then the third? + +How can the DAs set up a calibration pipeline, so that whenever a new piece of data is archived, it triggers a +calibration capability? How can a user arrange for their custom processing steps to be executed as soon as their data is +ingested into the archive? + +The answer to these questions is the future product system. + +## What is a product? + +To the archive, a product is something with a _science product locator_ in the archive. Science product locators are +URIs and look like this: `uid://alma/execblock/39113b0e-2f2e-4a73-8c95-c75c18b98ee9`. + +Every product locator string contains the instrument that generated the product, the "type" of the product that was +generated (`execblock` and `calibration` are common possibilities) and then a globally unique identifier (GUID). If you +have this locator, you have the master key for the archive for this product and can access it and cause it to be +downloaded. + +It is **extremely important** that you not try to parse the product locator. They to be treated as opaque identifiers, +except to the product locator service in the archive. + +## What is a future product? + +The future product system is a way of referring to products that already exist as well as products that do not yet +exist. Future products exist in one of two states: resolved or not-yet-resolved. The "base case" of the type of future +products wraps a science product locator and is resolved. There is a second "base case": An external execution block +identifier, which is resolved once that execution block is ingested and the science product locator is obtained. Finally +there is an inductive type: a capability request, which is resolved once the request is complete. + +The capability-request future product is what allows the creation of follow-on requests. Each time a follow-on request +is generated, it is generated against a future product of the original request. If this is done a second time, you get a +follow-on for a follow-on request. In this way you can construct a calibration which leads to some kind of data split, +which leads in a tree-like fashion to a number of imaging requests. + +If the original request is for an external execution block identifier, then you have the beginning of an integrated +calibration pipeline system, where the system is expecting an execution block to be ingested, and expecting to calibrate +it once that happens. The user ought to then be able to base (say) a custom imaging request as a follow-on for the +standard calibration, and not have to do anything reactively as the execblock or calibration become available in the +archive. + +## How does it work? + +When a future product is defined, it begins life in either the resolved or unresolved states, depending on whether it is +wrapping a product locator (resolved) or one of the other types (unresolved). + +A new system exists in the capability service, the future product system, which keeps track of future products. Each +product is tracked by its original definition and its current state of resolution. As new requests enter the system, if +they reuse a definition, they become linked because they share the original definition. + +The future product system is receiving events from the archive as products are ingested, and events from the capability +system as requests reach completion. It matches events from these sources to future products that are awaiting events. + +At the moment a future product becomes resolved, an event is generated by the future product system which is broadcast +to anyone that might be interested. The capability system is generally going to be interested in these events, so when +they come in, the capability service will see if it has any capability requests sitting at the AWAIT PRODUCT stage, +waiting on this future product. If there is an AWAIT PRODUCT for this product, that step is now completed and the +capability step sequence advances to the next step. + +## Examples + + +### Existing product +1. A capability request is made using a science product locator. The future product wraps the science product locator and it starts life resolved. + +3. The AWAIT PRODUCT step sees that the product is already resolved, and the engine advances to the +next step in the sequence immediately. + +### Follow-on request + +1. A capability request is made as a follow on against the previous request. The future product is a capability-request +future product; it is defined by wrapping the capability request identifier and begins life unresolved. + +2. The AWAIT PRODUCT step notices that the future product is unresolved and leaves the request in a dormant state. + +3. When the previous request completes, the future product system receives the event from the capability system that + a capability request completed. + +4. The future product system matches the request-complete event to the future product that was awaiting it. The +current-state-of-resolution is updated to a reference to that capability request's result. There are no more resolution +steps required, so the future product is now resolved. + +5. The future product system emits an event that this future product is resolved and here is the resolution. + +6. The capability system receives that event and matches it to the request in the AWAIT PRODUCT state. The step is + now complete and the engine advances the request to the next step. + +### External execution block identifier + +1. A capability request is made against an external execution block identifier. The future product is an +external-execution-block-id future product; it begins life unresolved. + +2. The AWAIT PRODUCT step notices that it is unresolved and leaves the request in a dormant state. + +3. When the execution block is ingested, an event enters the future product system with the science product locator. + The future product is located and the current-state-of-resolution is advanced to this science product locator, + and the product becomes resolved. + +4. The future product system emits an event with this resolution. + +5. The capability system receives the event and matches to the request in the AWAIT PRODUCT state. The step is now + complete and the engine advances the request to the next step.