Skip to content
Snippets Groups Projects

Add ARCHITECTURE.md for future products

Merged Daniel Lyons requested to merge documenting-future-products into main
3 files
+ 240
135
Compare changes
  • Side-by-side
  • Inline
Files
3
+ 133
0
# Delivery Architecture
What is delivery? Delivery is what happens after the the active processing portion of the workflow concludes. It is the
step that moves the retrieved or generated products from the processing area to a place where they can be accessed by
the requesting user.
Most workflows proceed by retrieving some files from NGAS and running CASA on those files to produce new products. The
files are large and CASA is quite heavy, so we retrieve the files into a spool area on the Lustre filesystem and then
launch the CASA jobs on the cluster. Once CASA is finished, the files the user wants are still sitting in that spool
area on Lustre. Delivery is what gets the files from there to where the user can retrieve them.
The simplest kind of delivery is just copying files from the spool area to another location—a mere `cp`. However, we
have several complications:
- CASA mandates a certain filesystem layout for the spool area
- The filesystem layout of the delivery destination varies based on the _type_ of the product
- Users can request `tar` archives, optionally
- Users can request delivery to their own areas in Lustre
- Not specifying a delivery location implies creating a unique location under a web root
We also want to be somewhat flexible in case new streaming kinds of deliveries are mandated in the future, such as
Globus (formerly GridFTP).
The result is that the behavior of the delivery process, which is fundamentally `cp`, varies both according to options
given by the user and various facts about the data we happen to be delivering.
## Handling files
At the bottom of every delivery process is a process of being supplied files and told to deliver them. The
_Destination_ system is the core of this portion of the process. The goal here is to decouple the idea of "here is a
file to deliver" from the details of how that delivery happens. We have one concrete class here, `LocalDestination`,
which represents the common `cp` case of copying a file into the destination. If the simplest delivery
is `cp source dest`, you can think of `LocalDestination` as embodying the idea of `cp ... dest`.
The _Destination_ classes make no sense on their own, their purpose is to be passed around to other objects in the
system that know about files that need to be delivered. The _Destination_ classes just hide the details about where
those files are actually going and how they're getting there.
If we were going to support something like Globus, I expect it would appear as a peer of `LocalDestination`, as another
concrete implementation of `Destination`.
### Checksums and compression
Thinking along these lines, one can think of checksums as the construction of another file to be added to the
destination. In fact, `Destination` is 1) handed every file to be delivered, and 2) knows where the files are ultimately
going to be placed, we can see a way to handle creating a checksum file as a kind of "pass-through" step that happens
automatically. The algorithm would look something like this:
1. Make a checksum wrapper for the local destination
2. For every file we get asked to deliver, calculate its checksum before handing it off to the wrapped destination for
delivery
3. After we are done delivering files, pass a fake file containing the checksums to the wrapped destination
This kind of "wrapper" or "pass-through" thing happens often enough in object-oriented programming that it is called
the "Decorator pattern." We can handle compression the same way:
1. Make a tar archive in a scratch area somewhere
2. For every file we get asked to deliver, instead place it in the archive in the scratch area
3. After we are done delivering files, finalize the archive and pass it to the wrapped destination
The key idea here is that the next part of the system which finds files to deliver has _no idea_ about whether we are
using compression or calculating checksums or not—in fact, these wrappers are stackable. The part of the system that
finds files to deliver just passes them to the destination, and as long as the stack of wrappers and destinations has
been constructed by someone in the right order, everything will happen as it should.
The purpose of the `DestinationBuilder` is to ensure that the stack is constructed in the right way. The reason
`Destination` has a `close()` method is for these wrappers to know when we are done delivering files so they can take
their finalization steps.
## Handling products
If you look at
the [delivery directory requirements](https://open-confluence.nrao.edu/display/SPR/Delivery+Directory+Improvements),
you'll see that there are a number of requirements to group things together based on their project or their telescope,
and the directory names are based on the type of product. Knowing what you have in hand affects the layout in the
delivery directory. This means that we are not always going to have a straightforward `cp` command, because the way
files rest in the spool area doesn't necessarily match the way that they need to be laid out in the delivery directory.
The key idea here is that somebody, eventually, knows what _they_ are, and the knowledge about how that _type_ is
delivered should live with that _type_, rather than being spread around the system. Execution blocks should know what
execution blocks are supposed to look like when they get delivered; images should know what images should look like when
they are delivered, and so forth. If a new type of product is invented, supporting a wacky delivery format for that
product should be a matter of defining that product type and adding the logic just to that product. This is why we have
a `SpooledProduct` with a single method: `deliver_to(Destination)`. We expect to have a driver that at some level is
passing a destination to each of these products and saying, "write yourself to this destination."
This suggests that when we are saying "deliver from here to there," we are not saying the same thing as `cp`, which is
saying "copy these files from here to there" but actually we're saying "copy all the products from here to there,
according to how each of these products _should_ be copied." In the beginning, a simple product like an execution block
_will_
simply deliver the files in its directory directly, but as we support more complex products like OUS requests with
images, more interesting things will happen.
## Finding products
How will we know what the products are that need to be delivered? We can assume we are given a source directory with
products in it, but how do we enumerate them in order to deliver them? The most straightforward answer is we can simply
iterate the entire directory and match filename patterns with product types; if it ends with `.ms` it's a measurement
set, if it looks like `PPR.xml` it's a pipeline request, etc. Doing this amounts to having a dispatch table of common
filename patterns, which is tedious, but exhaustive and gives our code a fair amount of control.
There is a second way to figure out the products, which is by examining CASA's `piperesults` output file. This file
isn't necessarily present (after all, CASA is not _required_ for every workflow) so this method cannot ever be the
_only_ means of determining the products. But it may eventually be a requirement that we support using the
`piperesults` file. So rather than having a single class here called `ProductFinder`, we instead have an interface
called `ProductFinder` and a `HeuristicProductFinder` that does the filename dispatch approach and a
`PiperesultsProductFinder` that uses the `piperesults` file to figure it out.
## Bringing it all together
So we have a system that finds products, products that know how to write themselves to a destination, and
destinations that know how to handle local filesystem writes, compression and checksumming. This is most of what is
needed. We can see now that we want to have a main loop that looks like this:
for product in finder.find_products():
product.deliver_to(destination)
What is still missing is a small amount of plumbing to get us from here to there. We need a device for processing
the command line arguments. Some aspects of delivery are based on user-supplied options: whether we are do tar
archives or not, whether we are delivering the raw data retrieved by the data fetcher or the products generated by
CASA. Eventually we will have to support a local delivery command line option. Basically, anything the user chooses
in the archive UI that affects delivery is going to arrive to us through the command line options. So we have to add
a command line parser, which we have in `Context`.
A few lessons-learned type things from the legacy delivery system are also in the `Context`. We assume that a few
"services" are available in `Context` to the `Destination` and `ProductFinder` schemes. For web delivery, we will
eventually need to be able to generate random codes for the URL, but we want those random codes to be stable
throughout the delivery process, so there is a way to do that in the `Context`. Also creating temporary files is
provided via the `Context`, which is something the tar and checksum wrappers will eventually need. So the `Context`
is available to these classes at construction time so they can call these services as needed, or peek at command
line arguments they may care about.
And that's the theory behind delivery in a nutshell.
Loading