Skip to content
Snippets Groups Projects
Commit b3969d54 authored by Daniel Lyons's avatar Daniel Lyons Committed by Daniel Lyons
Browse files

Add ARCHITECTURE.md for future products

Separate architecture from readme in delivery
parent 19eb6c16
No related branches found
No related tags found
1 merge request!65Add ARCHITECTURE.md for future products
Pipeline #460 passed
# Delivery Architecture
What is delivery? Delivery is what happens after the the active processing portion of the workflow concludes. It is the
step that moves the retrieved or generated products from the processing area to a place where they can be accessed by
the requesting user.
Most workflows proceed by retrieving some files from NGAS and running CASA on those files to produce new products. The
files are large and CASA is quite heavy, so we retrieve the files into a spool area on the Lustre filesystem and then
launch the CASA jobs on the cluster. Once CASA is finished, the files the user wants are still sitting in that spool
area on Lustre. Delivery is what gets the files from there to where the user can retrieve them.
The simplest kind of delivery is just copying files from the spool area to another location—a mere `cp`. However, we
have several complications:
- CASA mandates a certain filesystem layout for the spool area
- The filesystem layout of the delivery destination varies based on the _type_ of the product
- Users can request `tar` archives, optionally
- Users can request delivery to their own areas in Lustre
- Not specifying a delivery location implies creating a unique location under a web root
We also want to be somewhat flexible in case new streaming kinds of deliveries are mandated in the future, such as
Globus (formerly GridFTP).
The result is that the behavior of the delivery process, which is fundamentally `cp`, varies both according to options
given by the user and various facts about the data we happen to be delivering.
## Handling files
At the bottom of every delivery process is a process of being supplied files and told to deliver them. The
_Destination_ system is the core of this portion of the process. The goal here is to decouple the idea of "here is a
file to deliver" from the details of how that delivery happens. We have one concrete class here, `LocalDestination`,
which represents the common `cp` case of copying a file into the destination. If the simplest delivery
is `cp source dest`, you can think of `LocalDestination` as embodying the idea of `cp ... dest`.
The _Destination_ classes make no sense on their own, their purpose is to be passed around to other objects in the
system that know about files that need to be delivered. The _Destination_ classes just hide the details about where
those files are actually going and how they're getting there.
If we were going to support something like Globus, I expect it would appear as a peer of `LocalDestination`, as another
concrete implementation of `Destination`.
### Checksums and compression
Thinking along these lines, one can think of checksums as the construction of another file to be added to the
destination. In fact, `Destination` is 1) handed every file to be delivered, and 2) knows where the files are ultimately
going to be placed, we can see a way to handle creating a checksum file as a kind of "pass-through" step that happens
automatically. The algorithm would look something like this:
1. Make a checksum wrapper for the local destination
2. For every file we get asked to deliver, calculate its checksum before handing it off to the wrapped destination for
delivery
3. After we are done delivering files, pass a fake file containing the checksums to the wrapped destination
This kind of "wrapper" or "pass-through" thing happens often enough in object-oriented programming that it is called
the "Decorator pattern." We can handle compression the same way:
1. Make a tar archive in a scratch area somewhere
2. For every file we get asked to deliver, instead place it in the archive in the scratch area
3. After we are done delivering files, finalize the archive and pass it to the wrapped destination
The key idea here is that the next part of the system which finds files to deliver has _no idea_ about whether we are
using compression or calculating checksums or not—in fact, these wrappers are stackable. The part of the system that
finds files to deliver just passes them to the destination, and as long as the stack of wrappers and destinations has
been constructed by someone in the right order, everything will happen as it should.
The purpose of the `DestinationBuilder` is to ensure that the stack is constructed in the right way. The reason
`Destination` has a `close()` method is for these wrappers to know when we are done delivering files so they can take
their finalization steps.
## Handling products
If you look at
the [delivery directory requirements](https://open-confluence.nrao.edu/display/SPR/Delivery+Directory+Improvements),
you'll see that there are a number of requirements to group things together based on their project or their telescope,
and the directory names are based on the type of product. Knowing what you have in hand affects the layout in the
delivery directory. This means that we are not always going to have a straightforward `cp` command, because the way
files rest in the spool area doesn't necessarily match the way that they need to be laid out in the delivery directory.
The key idea here is that somebody, eventually, knows what _they_ are, and the knowledge about how that _type_ is
delivered should live with that _type_, rather than being spread around the system. Execution blocks should know what
execution blocks are supposed to look like when they get delivered; images should know what images should look like when
they are delivered, and so forth. If a new type of product is invented, supporting a wacky delivery format for that
product should be a matter of defining that product type and adding the logic just to that product. This is why we have
a `SpooledProduct` with a single method: `deliver_to(Destination)`. We expect to have a driver that at some level is
passing a destination to each of these products and saying, "write yourself to this destination."
This suggests that when we are saying "deliver from here to there," we are not saying the same thing as `cp`, which is
saying "copy these files from here to there" but actually we're saying "copy all the products from here to there,
according to how each of these products _should_ be copied." In the beginning, a simple product like an execution block
_will_
simply deliver the files in its directory directly, but as we support more complex products like OUS requests with
images, more interesting things will happen.
## Finding products
How will we know what the products are that need to be delivered? We can assume we are given a source directory with
products in it, but how do we enumerate them in order to deliver them? The most straightforward answer is we can simply
iterate the entire directory and match filename patterns with product types; if it ends with `.ms` it's a measurement
set, if it looks like `PPR.xml` it's a pipeline request, etc. Doing this amounts to having a dispatch table of common
filename patterns, which is tedious, but exhaustive and gives our code a fair amount of control.
There is a second way to figure out the products, which is by examining CASA's `piperesults` output file. This file
isn't necessarily present (after all, CASA is not _required_ for every workflow) so this method cannot ever be the
_only_ means of determining the products. But it may eventually be a requirement that we support using the
`piperesults` file. So rather than having a single class here called `ProductFinder`, we instead have an interface
called `ProductFinder` and a `HeuristicProductFinder` that does the filename dispatch approach and a
`PiperesultsProductFinder` that uses the `piperesults` file to figure it out.
## Bringing it all together
So we have a system that finds products, products that know how to write themselves to a destination, and
destinations that know how to handle local filesystem writes, compression and checksumming. This is most of what is
needed. We can see now that we want to have a main loop that looks like this:
for product in finder.find_products():
product.deliver_to(destination)
What is still missing is a small amount of plumbing to get us from here to there. We need a device for processing
the command line arguments. Some aspects of delivery are based on user-supplied options: whether we are do tar
archives or not, whether we are delivering the raw data retrieved by the data fetcher or the products generated by
CASA. Eventually we will have to support a local delivery command line option. Basically, anything the user chooses
in the archive UI that affects delivery is going to arrive to us through the command line options. So we have to add
a command line parser, which we have in `Context`.
A few lessons-learned type things from the legacy delivery system are also in the `Context`. We assume that a few
"services" are available in `Context` to the `Destination` and `ProductFinder` schemes. For web delivery, we will
eventually need to be able to generate random codes for the URL, but we want those random codes to be stable
throughout the delivery process, so there is a way to do that in the `Context`. Also creating temporary files is
provided via the `Context`, which is something the tar and checksum wrappers will eventually need. So the `Context`
is available to these classes at construction time so they can call these services as needed, or peek at command
line arguments they may care about.
And that's the theory behind delivery in a nutshell.
......@@ -8,138 +8,4 @@ This is the delivery thing.
https://open-confluence.nrao.edu/display/AAT/Proposed+Delivery+Redesign
https://open-confluence.nrao.edu/display/SPR/Delivery+Directory+Improvements
-->
## Theory
What is delivery? Delivery is what happens after the the active processing portion of the workflow concludes. It is the
step that moves the retrieved or generated products from the processing area to a place where they can be accessed by
the requesting user.
Most workflows proceed by retrieving some files from NGAS and running CASA on those files to produce new products. The
files are large and CASA is quite heavy, so we retrieve the files into a spool area on the Lustre filesystem and then
launch the CASA jobs on the cluster. Once CASA is finished, the files the user wants are still sitting in that spool
area on Lustre. Delivery is what gets the files from there to where the user can retrieve them.
The simplest kind of delivery is just copying files from the spool area to another location—a mere `cp`. However, we
have several complications:
- CASA mandates a certain filesystem layout for the spool area
- The filesystem layout of the delivery destination varies based on the _type_ of the product
- Users can request `tar` archives, optionally
- Users can request delivery to their own areas in Lustre
- Not specifying a delivery location implies creating a unique location under a web root
We also want to be somewhat flexible in case new streaming kinds of deliveries are mandated in the future, such as
Globus (formerly GridFTP).
The result is that the behavior of the delivery process, which is fundamentally `cp`, varies both according to options
given by the user and various facts about the data we happen to be delivering.
### Handling files
At the bottom of every delivery process is a process of being supplied files and told to deliver them. The
_Destination_ system is the core of this portion of the process. The goal here is to decouple the idea of "here is a
file to deliver" from the details of how that delivery happens. We have one concrete class here, `LocalDestination`,
which represents the common `cp` case of copying a file into the destination. If the simplest delivery
is `cp source dest`, you can think of `LocalDestination` as embodying the idea of `cp ... dest`.
The _Destination_ classes make no sense on their own, their purpose is to be passed around to other objects in the
system that know about files that need to be delivered. The _Destination_ classes just hide the details about where
those files are actually going and how they're getting there.
If we were going to support something like Globus, I expect it would appear as a peer of `LocalDestination`, as another
concrete implementation of `Destination`.
#### Checksums and compression
Thinking along these lines, one can think of checksums as the construction of another file to be added to the
destination. In fact, `Destination` is 1) handed every file to be delivered, and 2) knows where the files are ultimately
going to be placed, we can see a way to handle creating a checksum file as a kind of "pass-through" step that happens
automatically. The algorithm would look something like this:
1. Make a checksum wrapper for the local destination
2. For every file we get asked to deliver, calculate its checksum before handing it off to the wrapped destination for
delivery
3. After we are done delivering files, pass a fake file containing the checksums to the wrapped destination
This kind of "wrapper" or "pass-through" thing happens often enough in object-oriented programming that it is called
the "Decorator pattern." We can handle compression the same way:
1. Make a tar archive in a scratch area somewhere
2. For every file we get asked to deliver, instead place it in the archive in the scratch area
3. After we are done delivering files, finalize the archive and pass it to the wrapped destination
The key idea here is that the next part of the system which finds files to deliver has _no idea_ about whether we are
using compression or calculating checksums or not—in fact, these wrappers are stackable. The part of the system that
finds files to deliver just passes them to the destination, and as long as the stack of wrappers and destinations has
been constructed by someone in the right order, everything will happen as it should.
The purpose of the `DestinationBuilder` is to ensure that the stack is constructed in the right way. The reason
`Destination` has a `close()` method is for these wrappers to know when we are done delivering files so they can take
their finalization steps.
### Handling products
If you look at
the [delivery directory requirements](https://open-confluence.nrao.edu/display/SPR/Delivery+Directory+Improvements),
you'll see that there are a number of requirements to group things together based on their project or their telescope,
and the directory names are based on the type of product. Knowing what you have in hand affects the layout in the
delivery directory. This means that we are not always going to have a straightforward `cp` command, because the way
files rest in the spool area doesn't necessarily match the way that they need to be laid out in the delivery directory.
The key idea here is that somebody, eventually, knows what _they_ are, and the knowledge about how that _type_ is
delivered should live with that _type_, rather than being spread around the system. Execution blocks should know what
execution blocks are supposed to look like when they get delivered; images should know what images should look like when
they are delivered, and so forth. If a new type of product is invented, supporting a wacky delivery format for that
product should be a matter of defining that product type and adding the logic just to that product. This is why we have
a `SpooledProduct` with a single method: `deliver_to(Destination)`. We expect to have a driver that at some level is
passing a destination to each of these products and saying, "write yourself to this destination."
This suggests that when we are saying "deliver from here to there," we are not saying the same thing as `cp`, which is
saying "copy these files from here to there" but actually we're saying "copy all the products from here to there,
according to how each of these products _should_ be copied." In the beginning, a simple product like an execution block
_will_
simply deliver the files in its directory directly, but as we support more complex products like OUS requests with
images, more interesting things will happen.
### Finding products
How will we know what the products are that need to be delivered? We can assume we are given a source directory with
products in it, but how do we enumerate them in order to deliver them? The most straightforward answer is we can simply
iterate the entire directory and match filename patterns with product types; if it ends with `.ms` it's a measurement
set, if it looks like `PPR.xml` it's a pipeline request, etc. Doing this amounts to having a dispatch table of common
filename patterns, which is tedious, but exhaustive and gives our code a fair amount of control.
There is a second way to figure out the products, which is by examining CASA's `piperesults` output file. This file
isn't necessarily present (after all, CASA is not _required_ for every workflow) so this method cannot ever be the
_only_ means of determining the products. But it may eventually be a requirement that we support using the
`piperesults` file. So rather than having a single class here called `ProductFinder`, we instead have an interface
called `ProductFinder` and a `HeuristicProductFinder` that does the filename dispatch approach and a
`PiperesultsProductFinder` that uses the `piperesults` file to figure it out.
### Bringing it all together
So we have a system that finds products, products that know how to write themselves to a destination, and
destinations that know how to handle local filesystem writes, compression and checksumming. This is most of what is
needed. We can see now that we want to have a main loop that looks like this:
for product in finder.find_products():
product.deliver_to(destination)
What is still missing is a small amount of plumbing to get us from here to there. We need a device for processing
the command line arguments. Some aspects of delivery are based on user-supplied options: whether we are do tar
archives or not, whether we are delivering the raw data retrieved by the data fetcher or the products generated by
CASA. Eventually we will have to support a local delivery command line option. Basically, anything the user chooses
in the archive UI that affects delivery is going to arrive to us through the command line options. So we have to add
a command line parser, which we have in `Context`.
A few lessons-learned type things from the legacy delivery system are also in the `Context`. We assume that a few
"services" are available in `Context` to the `Destination` and `ProductFinder` schemes. For web delivery, we will
eventually need to be able to generate random codes for the URL, but we want those random codes to be stable
throughout the delivery process, so there is a way to do that in the `Context`. Also creating temporary files is
provided via the `Context`, which is something the tar and checksum wrappers will eventually need. So the `Context`
is available to these classes at construction time so they can call these services as needed, or peek at command
line arguments they may care about.
And that's the theory behind delivery in a nutshell.
-->
\ No newline at end of file
# Future Product Architecture
When I request data from the archive, the archive goes and retrieves it. What about if I want to retrieve the data, then
calibrate it, then split the data into several pieces and image each piece? Each of those requests is an independent _
capability request_, but how do we arrange for them to occur without making the user wait for the first request to
complete before doing the second, and the second and then the third?
How can the DAs set up a calibration pipeline, so that whenever a new piece of data is archived, it triggers a
calibration capability? How can a user arrange for their custom processing steps to be executed as soon as their data is
ingested into the archive?
The answer to these questions is the future product system.
## What is a product?
To the archive, a product is something with a _science product locator_ in the archive. Science product locators are
URIs and look like this: `uid://alma/execblock/39113b0e-2f2e-4a73-8c95-c75c18b98ee9`.
Every product locator string contains the instrument that generated the product, the "type" of the product that was
generated (`execblock` and `calibration` are common possibilities) and then a globally unique identifier (GUID). If you
have this locator, you have the master key for the archive for this product and can access it and cause it to be
downloaded.
It is **extremely important** that you not try to parse the product locator. They to be treated as opaque identifiers,
except to the product locator service in the archive.
## What is a future product?
The future product system is a way of referring to products that already exist as well as products that do not yet
exist. Future products exist in one of two states: resolved or not-yet-resolved. The "base case" of the type of future
products wraps a science product locator and is resolved. There is a second "base case": An external execution block
identifier, which is resolved once that execution block is ingested and the science product locator is obtained. Finally
there is an inductive type: a capability request, which is resolved once the request is complete.
The capability-request future product is what allows the creation of follow-on requests. Each time a follow-on request
is generated, it is generated against a future product of the original request. If this is done a second time, you get a
follow-on for a follow-on request. In this way you can construct a calibration which leads to some kind of data split,
which leads in a tree-like fashion to a number of imaging requests.
If the original request is for an external execution block identifier, then you have the beginning of an integrated
calibration pipeline system, where the system is expecting an execution block to be ingested, and expecting to calibrate
it once that happens. The user ought to then be able to base (say) a custom imaging request as a follow-on for the
standard calibration, and not have to do anything reactively as the execblock or calibration become available in the
archive.
## How does it work?
When a future product is defined, it begins life in either the resolved or unresolved states, depending on whether it is
wrapping a product locator (resolved) or one of the other types (unresolved).
A new system exists in the capability service, the future product system, which keeps track of future products. Each
product is tracked by its original definition and its current state of resolution. As new requests enter the system, if
they reuse a definition, they become linked because they share the original definition.
The future product system is receiving events from the archive as products are ingested, and events from the capability
system as requests reach completion. It matches events from these sources to future products that are awaiting events.
At the moment a future product becomes resolved, an event is generated by the future product system which is broadcast
to anyone that might be interested. The capability system is generally going to be interested in these events, so when
they come in, the capability service will see if it has any capability requests sitting at the AWAIT PRODUCT stage,
waiting on this future product. If there is an AWAIT PRODUCT for this product, that step is now completed and the
capability step sequence advances to the next step.
## Examples
### Existing product
1. A capability request is made using a science product locator. The future product wraps the science product locator and it starts life resolved.
3. The AWAIT PRODUCT step sees that the product is already resolved, and the engine advances to the
next step in the sequence immediately.
### Follow-on request
1. A capability request is made as a follow on against the previous request. The future product is a capability-request
future product; it is defined by wrapping the capability request identifier and begins life unresolved.
2. The AWAIT PRODUCT step notices that the future product is unresolved and leaves the request in a dormant state.
3. When the previous request completes, the future product system receives the event from the capability system that
a capability request completed.
4. The future product system matches the request-complete event to the future product that was awaiting it. The
current-state-of-resolution is updated to a reference to that capability request's result. There are no more resolution
steps required, so the future product is now resolved.
5. The future product system emits an event that this future product is resolved and here is the resolution.
6. The capability system receives that event and matches it to the request in the AWAIT PRODUCT state. The step is
now complete and the engine advances the request to the next step.
### External execution block identifier
1. A capability request is made against an external execution block identifier. The future product is an
external-execution-block-id future product; it begins life unresolved.
2. The AWAIT PRODUCT step notices that it is unresolved and leaves the request in a dormant state.
3. When the execution block is ingested, an event enters the future product system with the science product locator.
The future product is located and the current-state-of-resolution is advanced to this science product locator,
and the product becomes resolved.
4. The future product system emits an event with this resolution.
5. The capability system receives the event and matches to the request in the AWAIT PRODUCT state. The step is now
complete and the engine advances the request to the next step.
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment