How Workflows Run in the DSOC or NAASC
======================================

Introduction
------------

Our workflows are executed by HTCondor, which is a kind of job
scheduler.

HTCondor does lots of magic behind the scenes, but the interface is
simple: you make a directory with a job description file in it and
submit it. The machine you started at is the submit host. HTCondor
queues the job and eventually matches it up with an execution host. The
directory is copied to the execution host and according to the job
description, an executable is run. When the job completes, HTCondor
copies the results back to the submit host. During the interim, a log
file has been updated with everything that has occurred.

Data Locality
~~~~~~~~~~~~~

We have data in NGAS both at the DSOC and at the NAASC. The data at DSOC
are VLA and VLBA; at the NAASC they are ALMA and GBT. There are other
datacenters besides these two, like NMT, or hypothetical datacenters
like UW Madison, AWS or Xcede. These datacenters may or may not have
something like locality to various data—e.g. NMT nodes are not
physically far from the DSOC nodes (I don't know whether the link is
faster to NMT or not) but the UW Madison and AWS clusters are not
effectively local to anything. It's important for jobs to run local to
the data they need—it's cheaper to move the computation to the data than
the data to the computation.

An important attribute of HTCondor is that the submit host and the
execution host do not have to belong to the same datacenter or on the
same network. The Workspaces system relies on this fact by assuming that
there can be a single workflow service running on a single submit host,
and this is enough to arrange for jobs to be executed with locality to
ALMA or VLA data—more obviously, to submit jobs to be executed within
the DSOC or NAASC datacenter, or even others we don't yet possess.

Data Staging
~~~~~~~~~~~~

HTCondor
^^^^^^^^

HTCondor creates a heterogeneous cluster, meaning the execute hosts in
the cluster don't have to be similar to each other. HTCondor just makes
the job come to life in directory with the files mentioned in the job
description. So HTCondor is happy to move some files around for you, and
there is a plugin architecture to make this process extensible. This
means there are essentially two ways of staging data for your job:

1. Put your data in the directory you submit from and inform HTCondor;
   HTCondor will copy it to the job for you

2. Your job can retrieve its own data when it starts

By default, there is no way of dictating which network devices are used.
This prevents us from using HTCondor's data staging for bulk data—unless
we write a plugin.

NGAS
^^^^

Most jobs use archive data as their starting point. We have a program
called the datafetcher  whose job is to copy files out of NGAS. NGAS has
a direct-copy plugin we authored to speed retrieval from NGAS to Lustre.
Using the direct copy plugin utilizes Infiniband and is thus 10x faster
than streaming the data out over the 1Gbit ethernet. This may introduce
a dependency on the job being able to see the same lustre as the NGAS
cluster.

Follow-on requests
~~~~~~~~~~~~~~~~~~

A significant functionality of the capability system is that it permits
"follow-on requests." A follow-on request uses the result of another
request. The exact stage of the earlier request's lifecycle doesn't
matter, as long as it has not been cleaned from disk. This introduces a
second way for jobs to get their initial data besides the
datafetcher/NGAS. The request results are kept around in some accessible
location for later requests. Workflows that start from earlier requests
shouldn't need to be structurally different from workflows that start
with archived data.

This functionality is an important aspect of the workspaces system, but
I do not think it is part of the minimum-viable product of workspaces
(that decision is actually up to `Mark Lacy <file:////display/~mlacy>`__
and `Jeff Kern <file:////display/~jkern>`__) nor do I know whether this
functionality is expected early or late in SRDP development.

Workflows Plan
--------------

Cluster setup
~~~~~~~~~~~~~

The path forward here seems quite simple:

-  SCG sets up a node locally which pretends to be a NAASC/ALMA-local
   node

-  SSA conducts tests using that node

-  Meanwhile, SCG locates a node in the NAASC for further testing and
   sets it up

-  SCG migrates hosts shipman  and hamilton  to become submit hosts for
   the production HTCondor cluster

.. _data-staging-1:

Data staging
~~~~~~~~~~~~

Part of this was agreed upon in the meeting, and part through a
post-meeting discussion.

1. Our jobs will **not** use HTCondor for bulk data transfer into the
   execution directory

2. Data transfer in is performed by use of the datafetcher in an early
   step of the job—for NGAS (archived) data

3. Therefore, jobs **must** be executed in the datacenter with
   appropriate locality for the data to be fetched

4. Our software will create submit directories on the submit host's
   Lustre

5. Jobs run wherever they run, HTCondor will copy results back to this
   location on Lustre. This will be a Lustre primitive if the execution
   host happens to be in the same cluster; if it isn't, then rsync is
   used

This leaves two lingering questions:

1. How do we populate the job with data from an earlier request?

2. How do we create the Lustre directory in the NAASC, which is not
   visible to the submit host?

The solution to the first could either be using HTCondor's transfer
mechanism (which may be a local copy on lustre for same-datacenter jobs)
or by adapting the datafetcher. In the latter case, we need a safe way
of providing an rsync out of the datacenter with the files in lustre.

The second is a little more interesting. The path should exist prior to
the job starting, so that the HTCondor copy out step can be used to get
the results. But we want to have one submit host in the DSOC and not
have extra software that has to be deployed and running all the time in
the NAASC. Perhaps `James Robnett <file:////display/~jrobnett>`__ has an
idea about what to do here.

Science Data Flow
-----------------

.. image:: media/image1.png

In the simple download case, data is retrieved from NGAS to Lustre, from
which it is served to the user over HTTP.

For reprocessing, the most general case is one in which the data is
retrieved to a foreign datacenter for processing and then copied back.

.. image:: media/image2.png

There is a lingering question here about how to fetch data from NGAS in
a remote location. The streaming system that NGAS offers is going to use
the HTTP protocol and is unable to benefit from Infiniband. This option
sucks, basically. A better alternative would be implementing
Globus/GridFTP as an NGAS plugin. An intermediate solution might be
having a local job do the staging of data from NGAS to Lustre prior to
executing the cluster job.

To make reprocessing jobs relocatable, they must always behave as if
they are running in a local datacenter, but when HTCondor jobs are
scheduled inside the NRAO datacenter, their local storage will actually
be on lustre, and the data flow will reduce to something much more
similar to the first diagram:

.. image:: media/image3.png

This is because the data fetcher will preferentially use the local copy
option when it is available, and rsync can be made to use a local cp if
it is possible instead of the streaming network mechanism.