Daniel Lyons · eb7552d5 · b3969d54 · eb7552d5
--- a/apps/cli/executables/delivery/ARCHITECTURE.md 0 → 100644

+ 133

− 0
+++ b/apps/cli/executables/delivery/ARCHITECTURE.md 0 → 100644

+ 133

− 0
+# Delivery Architecture
+
+What is delivery? Delivery is what happens after the the active processing portion of the workflow concludes. It is the
+step that moves the retrieved or generated products from the processing area to a place where they can be accessed by
+the requesting user.
+
+Most workflows proceed by retrieving some files from NGAS and running CASA on those files to produce new products. The
+files are large and CASA is quite heavy, so we retrieve the files into a spool area on the Lustre filesystem and then
+launch the CASA jobs on the cluster. Once CASA is finished, the files the user wants are still sitting in that spool
+area on Lustre. Delivery is what gets the files from there to where the user can retrieve them.
+
+The simplest kind of delivery is just copying files from the spool area to another location—a mere `cp`. However, we
+have several complications:
+
+- CASA mandates a certain filesystem layout for the spool area
+- The filesystem layout of the delivery destination varies based on the _type_ of the product
+- Users can request `tar` archives, optionally
+- Users can request delivery to their own areas in Lustre
+- Not specifying a delivery location implies creating a unique location under a web root
+
+We also want to be somewhat flexible in case new streaming kinds of deliveries are mandated in the future, such as
+Globus (formerly GridFTP).
+
+The result is that the behavior of the delivery process, which is fundamentally `cp`, varies both according to options
+given by the user and various facts about the data we happen to be delivering.
+
+## Handling files
+
+At the bottom of every delivery process is a process of being supplied files and told to deliver them. The
+_Destination_ system is the core of this portion of the process. The goal here is to decouple the idea of "here is a
+file to deliver" from the details of how that delivery happens. We have one concrete class here, `LocalDestination`,
+which represents the common `cp` case of copying a file into the destination. If the simplest delivery
+is `cp source dest`, you can think of `LocalDestination` as embodying the idea of `cp ... dest`.
+
+The _Destination_ classes make no sense on their own, their purpose is to be passed around to other objects in the
+system that know about files that need to be delivered. The _Destination_ classes just hide the details about where
+those files are actually going and how they're getting there.
+
+If we were going to support something like Globus, I expect it would appear as a peer of `LocalDestination`, as another
+concrete implementation of `Destination`.
+
+### Checksums and compression
+
+Thinking along these lines, one can think of checksums as the construction of another file to be added to the
+destination. In fact, `Destination` is 1) handed every file to be delivered, and 2) knows where the files are ultimately
+going to be placed, we can see a way to handle creating a checksum file as a kind of "pass-through" step that happens
+automatically. The algorithm would look something like this:
+
+1. Make a checksum wrapper for the local destination
+2. For every file we get asked to deliver, calculate its checksum before handing it off to the wrapped destination for
+   delivery
+3. After we are done delivering files, pass a fake file containing the checksums to the wrapped destination
+
+This kind of "wrapper" or "pass-through" thing happens often enough in object-oriented programming that it is called
+the "Decorator pattern." We can handle compression the same way:
+
+1. Make a tar archive in a scratch area somewhere
+2. For every file we get asked to deliver, instead place it in the archive in the scratch area
+3. After we are done delivering files, finalize the archive and pass it to the wrapped destination
+
+The key idea here is that the next part of the system which finds files to deliver has _no idea_ about whether we are
+using compression or calculating checksums or not—in fact, these wrappers are stackable. The part of the system that
+finds files to deliver just passes them to the destination, and as long as the stack of wrappers and destinations has
+been constructed by someone in the right order, everything will happen as it should.
+
+The purpose of the `DestinationBuilder` is to ensure that the stack is constructed in the right way. The reason
+`Destination` has a `close()` method is for these wrappers to know when we are done delivering files so they can take
+their finalization steps.
+
+## Handling products
+
+If you look at
+the [delivery directory requirements](https://open-confluence.nrao.edu/display/SPR/Delivery+Directory+Improvements),
+you'll see that there are a number of requirements to group things together based on their project or their telescope,
+and the directory names are based on the type of product. Knowing what you have in hand affects the layout in the
+delivery directory. This means that we are not always going to have a straightforward `cp` command, because the way
+files rest in the spool area doesn't necessarily match the way that they need to be laid out in the delivery directory.
+
+The key idea here is that somebody, eventually, knows what _they_ are, and the knowledge about how that _type_ is
+delivered should live with that _type_, rather than being spread around the system. Execution blocks should know what
+execution blocks are supposed to look like when they get delivered; images should know what images should look like when
+they are delivered, and so forth. If a new type of product is invented, supporting a wacky delivery format for that
+product should be a matter of defining that product type and adding the logic just to that product. This is why we have
+a `SpooledProduct` with a single method: `deliver_to(Destination)`. We expect to have a driver that at some level is
+passing a destination to each of these products and saying, "write yourself to this destination."
+
+This suggests that when we are saying "deliver from here to there," we are not saying the same thing as `cp`, which is
+saying "copy these files from here to there" but actually we're saying "copy all the products from here to there,
+according to how each of these products _should_ be copied." In the beginning, a simple product like an execution block
+_will_
+simply deliver the files in its directory directly, but as we support more complex products like OUS requests with
+images, more interesting things will happen.
+
+## Finding products
+
+How will we know what the products are that need to be delivered? We can assume we are given a source directory with
+products in it, but how do we enumerate them in order to deliver them? The most straightforward answer is we can simply
+iterate the entire directory and match filename patterns with product types; if it ends with `.ms` it's a measurement
+set, if it looks like `PPR.xml` it's a pipeline request, etc. Doing this amounts to having a dispatch table of common
+filename patterns, which is tedious, but exhaustive and gives our code a fair amount of control.
+
+There is a second way to figure out the products, which is by examining CASA's `piperesults` output file. This file
+isn't necessarily present (after all, CASA is not _required_ for every workflow) so this method cannot ever be the
+_only_ means of determining the products. But it may eventually be a requirement that we support using the
+`piperesults` file. So rather than having a single class here called `ProductFinder`, we instead have an interface
+called `ProductFinder` and a `HeuristicProductFinder` that does the filename dispatch approach and a
+`PiperesultsProductFinder` that uses the `piperesults` file to figure it out.
+
+## Bringing it all together
+
+So we have a system that finds products, products that know how to write themselves to a destination, and
+destinations that know how to handle local filesystem writes, compression and checksumming. This is most of what is
+needed. We can see now that we want to have a main loop that looks like this:
+
+    for product in finder.find_products():
+      product.deliver_to(destination)
+
+What is still missing is a small amount of plumbing to get us from here to there. We need a device for processing
+the command line arguments. Some aspects of delivery are based on user-supplied options: whether we are do tar
+archives or not, whether we are delivering the raw data retrieved by the data fetcher or the products generated by
+CASA. Eventually we will have to support a local delivery command line option. Basically, anything the user chooses
+in the archive UI that affects delivery is going to arrive to us through the command line options. So we have to add
+a command line parser, which we have in `Context`.
+
+A few lessons-learned type things from the legacy delivery system are also in the `Context`. We assume that a few
+"services" are available in `Context` to the `Destination` and `ProductFinder` schemes. For web delivery, we will
+eventually need to be able to generate random codes for the URL, but we want those random codes to be stable
+throughout the delivery process, so there is a way to do that in the `Context`. Also creating temporary files is
+provided via the `Context`, which is something the tar and checksum wrappers will eventually need. So the `Context`
+is available to these classes at construction time so they can call these services as needed, or peek at command
+line arguments they may care about.
+
+And that's the theory behind delivery in a nutshell.