Skip to content
Snippets Groups Projects
how-workflows-run-in-the-dsoc-or-naasc.rst 7.14 KiB

How Workflows Run in the DSOC or NAASC

Introduction

Our workflows are executed by HTCondor, which is a kind of job scheduler. Unlike other job schedulers, HTCondor is designed to support a heterogeneous cluster and puts more emphasis on the reliability of the processing than the speed.

HTCondor does lots of magic behind the scenes, but the interface is simple: you make a directory with a job description file in it and submit it. The machine you started at is the submit host. HTCondor queues the job and eventually matches it up with an execution host. The directory is copied to the execution host and according to the job description, an executable is run. When the job completes, HTCondor copies the results back to the submit host. During the interim, a log file has been updated with everything that has occurred. Basically, you can pretend that it works as in this diagram:

images/htcondor-overview.png

Labels

HTCondor allows you to specify additional requirements for a job. These requirements are kind of like labels that the execute host should possess. We currently use three:

  • HasLustre: the execute host must have access to the Lustre network file system
  • CVPOST: the execute host must be in Charlottesville
  • VLASS: the execute host must be a VLASS-sponsored node

These labels do not form partitions; for instance, some VLASS nodes have Lustre and some do not.

Data Locality

To understand the data locality situation here, there is a table maintained by the SCG and CIS and the following diagram illustrates the situation:

images/data-locality-nrao.png

We have data in NGAS both at the DSOC and at the NAASC. The data at DSOC are VLA and VLBA; at the NAASC they are ALMA and GBT. There are other datacenters besides these two, like NMT, or hypothetical datacenters like UW Madison, AWS or Xcede. These datacenters may or may not have something like locality to various data—e.g. NMT nodes are not physically far from the DSOC nodes (I don't know whether the link is faster to NMT or not) but the UW Madison and AWS clusters are not effectively local to anything. It's important for jobs to run local to the data they need—it's cheaper to move the computation to the data than the data to the computation.

An important attribute of HTCondor is that the submit host and the execution host do not have to belong to the same datacenter or on the same network. The Workspaces system relies on this fact by assuming that there can be a single workflow service running on a single submit host, and this is enough to arrange for jobs to be executed with locality to ALMA or VLA data—more obviously, to submit jobs to be executed within the DSOC or NAASC datacenter, or even others we don't yet possess.

Data Staging

HTCondor

A traditional supercomputer would be comprised of identical nodes. Because HTCondor doesn't require that the nodes be similar to each other, it creates what's called a heterogeneous cluster. The main thing HTCondor does is simply making jobs come to life in directory with the files mentioned in the job description. A substantial subsystem in HTCondor concerns this data staging, meaning, moving files around for you, and there is a plugin architecture to make this process extensible. The result is that there are two approaches to staging data for your job:

  1. Put your data in the directory you submit from and inform HTCondor; HTCondor will copy it to the job for you somehow
  2. Your job can retrieve its own data when it starts

By default, there is no way of dictating which network devices are used. This prevents us from using HTCondor's data staging for bulk data. However, the Scientific Computing Group has written a plugin which we use to transfer data efficiently over the correct network interfaces, called condor-transfer-plugin.

NGAS

Most jobs use archived data as their starting point. We have a program called :doc:`productfetcher <../tools/productfetcher>` whose job is to copy files out of NGAS. NGAS has a direct-copy plugin we authored to speed retrieval from NGAS to Lustre. Using the direct copy plugin utilizes Infiniband and is thus 10× faster than streaming the data out over the 1 Gbit ethernet. This introduces a dependency on the job being able to see the same Lustre as the NGAS cluster, but this is handled by the aforementioned HasLustre label in the job description file.

Follow-on requests

A significant functionality of the capability system is that it permits "follow-on requests." A follow-on request uses the result of another request. The exact stage of the earlier request's lifecycle doesn't matter, as long as it has not been cleaned from disk. This introduces a second way for jobs to get their initial data besides the datafetcher/NGAS. The request results are kept around in some accessible location for later requests. Workflows that start from earlier requests shouldn't need to be structurally different from workflows that start with archived data.

This functionality is an important aspect of the workspaces system but it remains currently unimplemented.

Workflows Plan

Cluster setup

The path forward here seems quite simple:

  • SCG sets up a node locally which pretends to be a NAASC/ALMA-local node
  • SSA conducts tests using that node
  • Meanwhile, SCG locates a node in the NAASC for further testing and sets it up
  • SCG migrates hosts shipman and hamilton to become submit hosts for the production HTCondor cluster

Science Data Flow

images/how-workflows-run1.png

In the simple download case, data is retrieved from NGAS to Lustre, from which it is served to the user over HTTP.

For reprocessing, the most general case is one in which the data is retrieved to a foreign datacenter for processing and then copied back.

images/how-workflows-run2.png

There is a lingering question here about how to fetch data from NGAS in a remote location. The streaming system that NGAS offers is going to use the HTTP protocol and is unable to benefit from Infiniband. A better alternative would be implementing Globus/GridFTP as an NGAS plugin. An intermediate solution might be having a local job do the staging of data from NGAS to Lustre prior to executing the cluster job.

To make reprocessing jobs relocatable, they must always behave as if they are running in a local datacenter, but when HTCondor jobs are scheduled inside the NRAO datacenter, their local storage may actually be on Lustre (depending on the node), and the data flow will reduce to something much more similar to the first diagram:

images/how-workflows-run3.png

This is because the data fetcher will preferentially use the local copy option when it is available, and rsync can be made to use a local cp if it is possible instead of the streaming network mechanism.