Snippets Groups Projects

Ssa 6394 delivery tar creation

Reid Givens authored 4 years ago

bd27dabf

bd27dabf 4 years ago

Name	Last commit	Last update
..
delivery
test
README.md
setup.py

Delivery system for the Workspaces project

This is the delivery thing.

Theory

What is delivery? Delivery is what happens after the the active processing portion of the workflow concludes. It is the step that moves the retrieved or generated products from the processing area to a place where they can be accessed by the requesting user.

Most workflows proceed by retrieving some files from NGAS and running CASA on those files to produce new products. The files are large and CASA is quite heavy, so we retrieve the files into a spool area on the Lustre filesystem and then launch the CASA jobs on the cluster. Once CASA is finished, the files the user wants are still sitting in that spool area on Lustre. Delivery is what gets the files from there to where the user can retrieve them.

The simplest kind of delivery is just copying files from the spool area to another location—a mere cp. However, we have several complications:

CASA mandates a certain filesystem layout for the spool area
The filesystem layout of the delivery destination varies based on the type of the product
Users can request tar archives, optionally
Users can request delivery to their own areas in Lustre
Not specifying a delivery location implies creating a unique location under a web root

We also want to be somewhat flexible in case new streaming kinds of deliveries are mandated in the future, such as Globus (formerly GridFTP).

The result is that the behavior of the delivery process, which is fundamentally cp, varies both according to options given by the user and various facts about the data we happen to be delivering.

Handling files

At the bottom of every delivery process is a process of being supplied files and told to deliver them. The Destination system is the core of this portion of the process. The goal here is to decouple the idea of "here is a file to deliver" from the details of how that delivery happens. We have one concrete class here, LocalDestination, which represents the common cp case of copying a file into the destination. If the simplest delivery is cp source dest, you can think of LocalDestination as embodying the idea of cp ... dest.

The Destination classes make no sense on their own, their purpose is to be passed around to other objects in the system that know about files that need to be delivered. The Destination classes just hide the details about where those files are actually going and how they're getting there.

If we were going to support something like Globus, I expect it would appear as a peer of LocalDestination, as another concrete implementation of Destination.

Checksums and compression

Thinking along these lines, one can think of checksums as the construction of another file to be added to the destination. In fact, Destination is 1) handed every file to be delivered, and 2) knows where the files are ultimately going to be placed, we can see a way to handle creating a checksum file as a kind of "pass-through" step that happens automatically. The algorithm would look something like this:

Make a checksum wrapper for the local destination
For every file we get asked to deliver, calculate its checksum before handing it off to the wrapped destination for delivery
After we are done delivering files, pass a fake file containing the checksums to the wrapped destination

This kind of "wrapper" or "pass-through" thing happens often enough in object-oriented programming that it is called the "Decorator pattern." We can handle compression the same way:

Make a tar archive in a scratch area somewhere
For every file we get asked to deliver, instead place it in the archive in the scratch area
After we are done delivering files, finalize the archive and pass it to the wrapped destination

The key idea here is that the next part of the system which finds files to deliver has no idea about whether we are using compression or calculating checksums or not—in fact, these wrappers are stackable. The part of the system that finds files to deliver just passes them to the destination, and as long as the stack of wrappers and destinations has been constructed by someone in the right order, everything will happen as it should.

The purpose of the DestinationBuilder is to ensure that the stack is constructed in the right way. The reason Destination has a close() method is for these wrappers to know when we are done delivering files so they can take their finalization steps.

Handling products

If you look at the delivery directory requirements, you'll see that there are a number of requirements to group things together based on their project or their telescope, and the directory names are based on the type of product. Knowing what you have in hand affects the layout in the delivery directory. This means that we are not always going to have a straightforward cp command, because the way files rest in the spool area doesn't necessarily match the way that they need to be laid out in the delivery directory.

The key idea here is that somebody, eventually, knows what they are, and the knowledge about how that type is delivered should live with that type, rather than being spread around the system. Execution blocks should know what execution blocks are supposed to look like when they get delivered; images should know what images should look like when they are delivered, and so forth. If a new type of product is invented, supporting a wacky delivery format for that product should be a matter of defining that product type and adding the logic just to that product. This is why we have a SpooledProduct with a single method: deliver_to(Destination). We expect to have a driver that at some level is passing a destination to each of these products and saying, "write yourself to this destination."

This suggests that when we are saying "deliver from here to there," we are not saying the same thing as cp, which is saying "copy these files from here to there" but actually we're saying "copy all the products from here to there, according to how each of these products should be copied." In the beginning, a simple product like an execution block will simply deliver the files in its directory directly, but as we support more complex products like OUS requests with images, more interesting things will happen.

Finding products

How will we know what the products are that need to be delivered? We can assume we are given a source directory with products in it, but how do we enumerate them in order to deliver them? The most straightforward answer is we can simply iterate the entire directory and match filename patterns with product types; if it ends with .ms it's a measurement set, if it looks like PPR.xml it's a pipeline request, etc. Doing this amounts to having a dispatch table of common filename patterns, which is tedious, but exhaustive and gives our code a fair amount of control.

There is a second way to figure out the products, which is by examining CASA's piperesults output file. This file isn't necessarily present (after all, CASA is not required for every workflow) so this method cannot ever be the only means of determining the products. But it may eventually be a requirement that we support using the piperesults file. So rather than having a single class here called ProductFinder, we instead have an interface called ProductFinder and a HeuristicProductFinder that does the filename dispatch approach and a PiperesultsProductFinder that uses the piperesults file to figure it out.

Bringing it all together

So we have a system that finds products, products that know how to write themselves to a destination, and destinations that know how to handle local filesystem writes, compression and checksumming. This is most of what is needed. We can see now that we want to have a main loop that looks like this:

for product in finder.find_products():
  product.deliver_to(destination)

What is still missing is a small amount of plumbing to get us from here to there. We need a device for processing the command line arguments. Some aspects of delivery are based on user-supplied options: whether we are do tar archives or not, whether we are delivering the raw data retrieved by the data fetcher or the products generated by CASA. Eventually we will have to support a local delivery command line option. Basically, anything the user chooses in the archive UI that affects delivery is going to arrive to us through the command line options. So we have to add a command line parser, which we have in Context.

A few lessons-learned type things from the legacy delivery system are also in the Context. We assume that a few "services" are available in Context to the Destination and ProductFinder schemes. For web delivery, we will eventually need to be able to generate random codes for the URL, but we want those random codes to be stable throughout the delivery process, so there is a way to do that in the Context. Also creating temporary files is provided via the Context, which is something the tar and checksum wrappers will eventually need. So the Context is available to these classes at construction time so they can call these services as needed, or peek at command line arguments they may care about.

And that's the theory behind delivery in a nutshell.