-
Nathan Hertz authoredNathan Hertz authored
How Workflows Run in the DSOC or NAASC
Introduction
Our workflows are executed by HTCondor, which is a kind of job scheduler.
HTCondor does lots of magic behind the scenes, but the interface is simple: you make a directory with a job description file in it and submit it. The machine you started at is the submit host. HTCondor queues the job and eventually matches it up with an execution host. The directory is copied to the execution host and according to the job description, an executable is run. When the job completes, HTCondor copies the results back to the submit host. During the interim, a log file has been updated with everything that has occurred.
Data Locality
We have data in NGAS both at the DSOC and at the NAASC. The data at DSOC are VLA and VLBA; at the NAASC they are ALMA and GBT. There are other datacenters besides these two, like NMT, or hypothetical datacenters like UW Madison, AWS or Xcede. These datacenters may or may not have something like locality to various data—e.g. NMT nodes are not physically far from the DSOC nodes (I don't know whether the link is faster to NMT or not) but the UW Madison and AWS clusters are not effectively local to anything. It's important for jobs to run local to the data they need—it's cheaper to move the computation to the data than the data to the computation.
An important attribute of HTCondor is that the submit host and the execution host do not have to belong to the same datacenter or on the same network. The Workspaces system relies on this fact by assuming that there can be a single workflow service running on a single submit host, and this is enough to arrange for jobs to be executed with locality to ALMA or VLA data—more obviously, to submit jobs to be executed within the DSOC or NAASC datacenter, or even others we don't yet possess.
Data Staging
HTCondor
HTCondor creates a heterogeneous cluster, meaning the execute hosts in the cluster don't have to be similar to each other. HTCondor just makes the job come to life in directory with the files mentioned in the job description. So HTCondor is happy to move some files around for you, and there is a plugin architecture to make this process extensible. This means there are essentially two ways of staging data for your job:
- Put your data in the directory you submit from and inform HTCondor; HTCondor will copy it to the job for you
- Your job can retrieve its own data when it starts
By default, there is no way of dictating which network devices are used. This prevents us from using HTCondor's data staging for bulk data—unless we write a plugin.
NGAS
Most jobs use archive data as their starting point. We have a program called the datafetcher whose job is to copy files out of NGAS. NGAS has a direct-copy plugin we authored to speed retrieval from NGAS to Lustre. Using the direct copy plugin utilizes Infiniband and is thus 10x faster than streaming the data out over the 1Gbit ethernet. This may introduce a dependency on the job being able to see the same lustre as the NGAS cluster.
Follow-on requests
A significant functionality of the capability system is that it permits "follow-on requests." A follow-on request uses the result of another request. The exact stage of the earlier request's lifecycle doesn't matter, as long as it has not been cleaned from disk. This introduces a second way for jobs to get their initial data besides the datafetcher/NGAS. The request results are kept around in some accessible location for later requests. Workflows that start from earlier requests shouldn't need to be structurally different from workflows that start with archived data.
This functionality is an important aspect of the workspaces system, but I do not think it is part of the minimum-viable product of workspaces (that decision is actually up to Mark Lacy and Jeff Kern) nor do I know whether this functionality is expected early or late in SRDP development.
Workflows Plan
Cluster setup
The path forward here seems quite simple:
- SCG sets up a node locally which pretends to be a NAASC/ALMA-local node
- SSA conducts tests using that node
- Meanwhile, SCG locates a node in the NAASC for further testing and sets it up
- SCG migrates hosts shipman and hamilton to become submit hosts for the production HTCondor cluster
Data staging
Part of this was agreed upon in the meeting, and part through a post-meeting discussion.
- Our jobs will not use HTCondor for bulk data transfer into the execution directory
- Data transfer in is performed by use of the datafetcher in an early step of the job—for NGAS (archived) data
- Therefore, jobs must be executed in the datacenter with appropriate locality for the data to be fetched
- Our software will create submit directories on the submit host's Lustre
- Jobs run wherever they run, HTCondor will copy results back to this location on Lustre. This will be a Lustre primitive if the execution host happens to be in the same cluster; if it isn't, then rsync is used
This leaves two lingering questions:
- How do we populate the job with data from an earlier request?
- How do we create the Lustre directory in the NAASC, which is not visible to the submit host?
The solution to the first could either be using HTCondor's transfer mechanism (which may be a local copy on lustre for same-datacenter jobs) or by adapting the datafetcher. In the latter case, we need a safe way of providing an rsync out of the datacenter with the files in lustre.
The second is a little more interesting. The path should exist prior to the job starting, so that the HTCondor copy out step can be used to get the results. But we want to have one submit host in the DSOC and not have extra software that has to be deployed and running all the time in the NAASC. Perhaps James Robnett has an idea about what to do here.
Science Data Flow

In the simple download case, data is retrieved from NGAS to Lustre, from which it is served to the user over HTTP.
For reprocessing, the most general case is one in which the data is retrieved to a foreign datacenter for processing and then copied back.

There is a lingering question here about how to fetch data from NGAS in a remote location. The streaming system that NGAS offers is going to use the HTTP protocol and is unable to benefit from Infiniband. This option sucks, basically. A better alternative would be implementing Globus/GridFTP as an NGAS plugin. An intermediate solution might be having a local job do the staging of data from NGAS to Lustre prior to executing the cluster job.
To make reprocessing jobs relocatable, they must always behave as if they are running in a local datacenter, but when HTCondor jobs are scheduled inside the NRAO datacenter, their local storage will actually be on lustre, and the data flow will reduce to something much more similar to the first diagram:

This is because the data fetcher will preferentially use the local copy option when it is available, and rsync can be made to use a local cp if it is possible instead of the streaming network mechanism.