-
Andrew Kapuscinski authoredAndrew Kapuscinski authored
Product Fetcher Architecture
The product fetcher is designed to retrieve data products from our archives. As the name suggests, the input to the
product fetcher is a science product locator, a string like
uid://evla/execblock/27561b56-4c6a-4614-bc26-67e436b5e92c
. The science product locator is decoded by a service called
the locator service, which uses the archive's knowledge of different instruments and their storage locations to produce
something called a location report. The location report contains a list of files that are associated to the science
product, and information about where they can be obtained. The job of the product locator is to interpret the report and
retrieve the files from wherever they may be.
The goals for the product locator are:
- Accuracy: retrieving the files correctly, including retrying as necessary and verifying file content
- Speed: retrieving the files as quickly as possible without sacrificing accuracy
Because the work is mostly I/O bound and accesses many servers, the product fetcher depends on a high degree of concurrency to achieve speed.
Map
I divide the fetching process into two stages. In the first stage, we're generating a plan; in the second stage, we're executing the plan. The "meat" of the program and the bulk of the time and effort takes place in the second stage and is built out of the following pieces:
FileFetcher
The core of the program is what happens inside a FileFetcher. A FileFetcher retrieves a single file. There are several different ways files can be stored and there is a FileFetcher for each storage medium and access method. At the moment, this means there are three implementations of FileFetcher:
- NgasStreamingFileFetcher, which does a web request against an NGAS resource and writes the result to disk
- NgasDirectCopyFileFetcher, which asks NGAS to write a resource to a certain path on disk
- OracleXmlFileFetcher, which queries Oracle for a value in a certain row of a certain table and writes the result to a certain path on disk
FileFetchers have a simple API: you provide them with a file and some validations to run, and they fetch the file and then run the validations. Because the design is fairly simple, we have a few utility FileFetchers:
- RetryingFileFetcher, which retries another file fetcher a certain number of times, to increase our fault tolerance
- DryRunFakeFileFetcher, which is used in "dry run" mode to simply print what would be fetched, and in the unit tests
- NgasModeDetectingFileFetcher, which is used when the user has no preference for an NGAS access method and just wants the program to look at the file and the destination and make a decision
FileValidator
After a file is fetched, we can do some analysis on the file to make sure that it was correctly retrieved. Files that are stored in NGAS have some associated information we can utilize: the size and a CRC32 checksum. Files stored in an Oracle database are XML and can be checked for well-formedness, as well as their size. These are the three validators currently supported:
- ChecksumValidator, which checks the CRC32 checksum value from NGAS
- SizeValidator, which ensures that the files on disk have the size we anticipated
- XmlValidator, which checks that the file is well-formed XML (not valid, which would require a schema)
Planning
The first stage of the program consists of building a fetch plan. A FetchPlan is, in a trivial sense, a list of FileFetchers that need to be executed. So the first thing we need to do is get a location report and ask it to make us a list of fetchers.
LocationReport
The location report represents the report we get back from the locator service. It just contains a list of locations, which may be NGAS files or Oracle XML tables, and the information needed to validate the retrieval and where to place the files relative to some destination directory. That's it.