ARCHITECTURE.md



Product Fetcher Architecture
The product fetcher is designed to retrieve data products from our archives. As the name suggests, the input to the
product fetcher is a science product locator, a string like
uid://evla/execblock/27561b56-4c6a-4614-bc26-67e436b5e92c. The science product locator is decoded by a service called
the locator service, which uses the archive's knowledge of different instruments and their storage locations to produce
something called a location report. The location report contains a list of files that are associated to the science
product, and information about where they can be obtained. The job of the product locator is to interpret the report and
retrieve the files from wherever they may be.
The goals for the product locator are:

Accuracy: retrieving the files correctly, including retrying as necessary and verifying file content
Speed: retrieving the files as quickly as possible without sacrificing accuracy

Because the work is mostly I/O bound and accesses many servers, the product fetcher depends on a high degree of
concurrency to achieve speed.

Map
I divide the fetching process into two stages. In the first stage, we're generating a plan; in the second stage, we're
executing the plan. The "meat" of the program and the bulk of the time and effort takes place in the second stage
and is built out of the following pieces:

FileFetcher
The core of the program is what happens inside a FileFetcher. A FileFetcher retrieves a single file. There are several
different ways files can be stored and there is a FileFetcher for each storage medium and access method. At the moment,
this means there are three implementations of FileFetcher:

NgasStreamingFileFetcher, which does a web request against an NGAS resource and writes the result to disk
NgasDirectCopyFileFetcher, which asks NGAS to write a resource to a certain path on disk
OracleXmlFileFetcher, which queries Oracle for a value in a certain row of a certain table and writes the result to a
certain path on disk

FileFetchers have a simple API: you provide them with a file and some validations to run, and they fetch the file
and then run the validations. Because the design is fairly simple, we have a few utility FileFetchers:

RetryingFileFetcher, which retries another file fetcher a certain number of times, to increase our fault tolerance
DryRunFakeFileFetcher, which is used in "dry run" mode to simply print what would be fetched, and in the unit tests
NgasModeDetectingFileFetcher, which is used when the user has no preference for an NGAS access method and just
wants the program to look at the file and the destination and make a decision


FileValidator
After a file is fetched, we can do some analysis on the file to make sure that it was correctly retrieved. Files
that are stored in NGAS have some associated information we can utilize: the size and a CRC32 checksum. Files stored
in an Oracle database are XML and can be checked for well-formedness, as well as their size. These are the three
validators currently supported:

ChecksumValidator, which checks the CRC32 checksum value from NGAS
SizeValidator, which ensures that the files on disk have the size we anticipated
XmlValidator, which checks that the file is well-formed XML (not valid, which would require a schema)


Planning
The first stage of the program consists of building a fetch plan. A FetchPlan is, in a trivial sense, a list of
FileFetchers that need to be executed. So the first thing we need to do is get a location report and ask it to make
us a list of fetchers.

LocationReport
The location report represents the report we get back from the locator service. It just contains a list of locations,
which may be NGAS files or Oracle XML tables, and the information needed to validate the retrieval and where to
place the files relative to some destination directory. That's it.

Locator