Skip to content
Snippets Groups Projects

Architecture of the Product Fetcher

The product fetcher is designed to retrieve data products from our archives. As the name suggests, the input to the product fetcher is a science product locator, a string like uid://evla/execblock/27561b56-4c6a-4614-bc26-67e436b5e92c. The science product locator is decoded by a service called the locator service, which uses the archive's knowledge of different instruments and their storage locations to produce something called a location report. The location report contains a list of files that are associated to the science product, and information about where they can be obtained. The job of the product locator is to interpret the report and retrieve the files from wherever they may be.

The goals for the product locator are:

  • Accuracy: retrieving the files correctly, including retrying as necessary and verifying file content
  • Speed: retrieving the files as quickly as possible without sacrificing accuracy

Because the work is mostly I/O bound and accesses many servers, the product fetcher depends on a high degree of concurrency to achieve speed.

Map

I divide the fetching process into two stages. In the first stage, we're generating a plan; in the second stage, we're executing the plan. The "meat" of the program and the bulk of the time and effort takes place in the second stage and is built out of the following pieces:

FileFetcher

The core of the program is what happens inside a FileFetcher. A FileFetcher retrieves a single file. There are several different ways files can be stored and there is a FileFetcher for each storage medium and access method. At the moment, this means there are three implementations of FileFetcher:

  • NgasStreamingFileFetcher, which does a web request against an NGAS resource and writes the result to disk
  • NgasDirectCopyFileFetcher, which asks NGAS to write a resource to a certain path on disk
  • OracleXmlFileFetcher, which queries Oracle for a value in a certain row of a certain table and writes the result to a certain path on disk

FileFetchers have a simple API: you provide them with a file and some validations to run, and they fetch the file and then run the validations. Because the design is fairly simple, we have a few utility FileFetchers:

  • RetryingFileFetcher, which retries another file fetcher a certain number of times, to increase our fault tolerance
  • DryRunFakeFileFetcher, which is used in "dry run" mode to simply print what would be fetched, and in the unit tests
  • NgasModeDetectingFileFetcher, which is used when the user has no preference for an NGAS access method and just wants the program to look at the file and the destination and make a decision

FileValidator

After a file is fetched, we can do some analysis on the file to make sure that it was correctly retrieved. Files that are stored in NGAS have some associated information we can utilize: the size and a CRC32 checksum. Files stored in an Oracle database are XML and can be checked for well-formedness, as well as their size. These are the three validators currently supported:

  • ChecksumValidator, which checks the CRC32 checksum value from NGAS
  • SizeValidator, which ensures that the files on disk have the size we anticipated
  • XmlValidator, which checks that the file is well-formed XML (not valid, which would require a schema)

Planning

The first stage of the program consists of building a fetch plan. A FetchPlan is, in a trivial sense, a list of FileFetchers that need to be executed. So the first thing we need to do is get a location report and ask it to make us a list of fetchers.

LocationReport

The location report represents the report we get back from the locator service. It just contains a list of locations, which may be NGAS files or Oracle XML tables, and the information needed to validate the retrieval and where to place the files relative to some destination directory. That's it.

Locator

LocationReports come to us from two sources: from an archive REST service ("the locator service"), or from a file on disk given to us at the start of execution, which is usually for testing but could occur in the wild if we needed to utilize a fetch plan without access to the locator service. Because we have two ways to obtain LocationReports, we have an interface for Locators and two implementations for finding them:

  • FileLocator: given a file, it parses the file to obtain the LocationReport
  • ServiceLocator: given a science product locator, queries the service to obtain the LocationReport

FetchFactory

Earlier in the design I mentioned that we need to get FileFetchers from the LocationReport. This is actually a responsibility of the LocationReport itself, because it knows what kinds of entries appear in the report. But, users can specify constraints on the details of how a file can be fetched, such as requiring NGAS streaming or doing a dry run. The LocationReport needs to generate FileFetchers, but its choices need to be parameterized somehow by user input.

The solution is simple: the LocationReport relies on a FetcherFactory. The FetcherFactory provides an API to make the kinds of fetchers we need to make. The LocationReport asks the FetcherFactory to make fetchers for the different flavors of file it has. FetcherFactory is implemented just once, with ConfiguredFetcherFactory, which is parameterized by various settings from the command line arguments. ConfiguredFetcherFactory knows that if the user asked for a dry run, when the LocationReport wants to create a FileFetcher it instead returns DryRunFakeFileFetcher instead of (say) NgasStreamingFileFetcher.

FetchPlan

A FetchPlan is really a way of aggregating FileFetchers together. It turns out that the API for a FetchPlan is the same as the API for a FileFetcher, except we don't have a return value from fetch. But, knowing that we have a list of FileFetchers to execute, we can execute them in parallel. This gives us a performance benefit.

There's one implementation of FetchPlan in the system, which is ParallelFetchPlan. The ParallelFetchPlan is parameterized by a concurrency level. Internally, it works by using a concurrent.futures ThreadPoolExecutor to run tasks in parallel. Each FileFetcher is a task, so if you have (say) 300 FileFetchers and a concurrency level of 16, you will see 16 of those fetches running at a time until the list is exhausted.

FetchContext

What's left to be done is the handling of optional flags and the generation of the FetchPlan. It turns out to be unwise to exceed a certain level of parallelism on a single server. In the past, files were grouped by server, but a simpler mechanism that achieves the same end is used here: each file can be asked whether it shares resources with another file. This would be true of NGAS fetches if they talk to the same NGAS host, and it's true of all OracleXml fetches because all of them access the same database connection. The FetchContext is responsible for generating the fetch plan by taking the file groups established in this way and making a FetchPlan for each group, and then wrapping each of those FetchPlans in another overall FetchPlan:

FetchPlan (overall)
├── FetchPlan (server 1)
│   ├── FetchNgasFile #1
│   ├── FetchNgasFile #2
│   ├── FetchNgasFile #3
│   ├── ...
├── FetchPlan (server 2)
│   ├── FetchNgasFile #1
│   ├── FetchNgasFile #2
│   ├── FetchNgasFile #3
│   ├── ...
├── FetchPlan (ALMA SDMs)
│   ├── OracleXmlFetcher #1
│   ├── OracleXmlFetcher #2
│   ├── ...

If parallel fetching is enabled, the overall FetchPlan will launch (in this example) three threads, each of which will then launch (say) 4 threads to conduct its own fetching, resulting in 12 total threads. The fetch should be pretty fast.

FetchContext also parses the command line arguments, which it feeds into the ConfiguredFetchFactory.

The main body of the program is then simply this:

context = FetchContext.parse_commandline(args)
plan = context.generate_plan()
plan.fetch()