Skip to content
Snippets Groups Projects
Commit bd5183de authored by Nathan Hertz's avatar Nathan Hertz
Browse files

Planned out design. Started laying out class/method structure.

parent cd7af338
No related branches found
No related tags found
No related merge requests found
Showing
with 2012 additions and 0 deletions
# Workflow Monitor
## Design
- Reads in HTCondor logs
- HTCondor executes a job
- condor-submit executed
- Log file is created and written to a location
- Meanwhile, wf_monitor is checking that location for the appearance of a log, sleeping between attempts
- Once log file is found, its contents are read in
- Parses them and converts them into an AMQP event (to be determined)
- Log file contents are parsed and relevant details extracted (relevant details TBD)
- Those relevant details are organized in a data structure
- An AMQPEvent is then constructed from that data structure
- AMQPEvent is given to events/pymygdala/some currently nonexistant service
\ No newline at end of file
000 (3983.000.000) 08/26 11:06:06 Job submitted from host: <10.64.1.178:9618?addrs=10.64.1.178-9618&noUDP&sock=4050180_7e37_3>
DAG Node: grep
...
001 (3983.000.000) 08/26 11:06:25 Job executing on host: <10.64.1.171:9618?addrs=10.64.1.171-9618&alias=testpost001.aoc.nrao.edu&noUDP&sock=startd_15099_3c91>
...
006 (3983.000.000) 08/26 11:06:25 Image size of job updated: 72
1 - MemoryUsage of job (MB)
72 - ResidentSetSize of job (KB)
...
005 (3983.000.000) 08/26 11:06:25 Job terminated.
(1) Normal termination (return value 0)
Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
374 - Run Bytes Sent By Job
26855 - Run Bytes Received By Job
374 - Total Bytes Sent By Job
26855 - Total Bytes Received By Job
Partitionable Resources : Usage Request Allocated
Cpus : 0 1 1
Disk (KB) : 41 30 204736378
Memory (MB) : 1 1 128
...
000 (3984.000.000) 08/26 11:06:31 Job submitted from host: <10.64.1.178:9618?addrs=10.64.1.178-9618&noUDP&sock=4050180_7e37_3>
DAG Node: uniq
...
001 (3984.000.000) 08/26 11:06:31 Job executing on host: <10.64.1.171:9618?addrs=10.64.1.171-9618&alias=testpost001.aoc.nrao.edu&noUDP&sock=startd_15099_3c91>
...
006 (3984.000.000) 08/26 11:06:31 Image size of job updated: 432
1 - MemoryUsage of job (MB)
432 - ResidentSetSize of job (KB)
...
005 (3984.000.000) 08/26 11:06:32 Job terminated.
(1) Normal termination (return value 0)
Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
0 - Run Bytes Sent By Job
0 - Run Bytes Received By Job
0 - Total Bytes Sent By Job
0 - Total Bytes Received By Job
Partitionable Resources : Usage Request Allocated
Cpus : 1 1
Disk (KB) : 2 2 204736378
Memory (MB) : 1 1 128
...
#!/usr/bin/python
# -*- coding: utf-8 -*-
from pathlib import Path
from setuptools import setup
VERSION = open('src/wf_monitor/_version.py').readlines()[-1].split()[-1].strip("\"'")
README = Path('README.md').read_text()
# requires = [
# ]
tests_require = [
'pytest>=5.4,<6.0'
]
setup(
name=Path().absolute().name,
version=VERSION,
description='Workflow monitor that reads in HTCondor logs and translates them into AMQP events',
long_description=README,
author='NRAO SSA Team',
author_email='dms-ssa@nrao.edu',
url='TBD',
license="GPL",
# install_requires=requires,
tests_require=tests_require,
keywords=[],
packages=['wf_monitor'],
package_dir={'':'src'},
classifiers=[
'Programming Language :: Python :: 3.8'
],
entry_points={
'console_scripts': ['']
},
)
""" Version information for this package, don't put anything else here. """
___version___ = '4.0.0a1.dev1'
import os
import json
import time
import signal
from pathlib import Path
class WorkflowEvent:
def __init__(self, logfile_path: str):
self.log = self.read_htcondor_log(Path(logfile_path))
self.job_metadata = self.parse_log()
self.job_json = {}
for job_metadatum in self.job_metadata:
self.job_json[job_metadatum['name']] = self.job_to_amqp_event(job_metadatum)
def read_htcondor_log(self, logfile_path: Path, timeout: int = 60):
start_time = time.localtime()
if os.path.exists(logfile_path):
with open(logfile_path, 'r') as f:
return f.read()
else:
def parse_log(self) -> dict:
"""
Parse log for relevant details:
- For each job:
- Job name
- If/when the job executed
- If/when/why the job terminated
:return: Dictionary of important metadata for each job
"""
pass
def job_to_amqp_event(self, job_metadatum: dict):
"""
Converts job metadata dictionary into a JSON string and ...
:param job_metadatum: Job metadata
:return: JSON string of job metadata
"""
return json.dumps(job_metadatum) # ?
\ No newline at end of file
import signal
def timeout_handler(signum, frame):
print("Function {}")
\ No newline at end of file
.idea
__pycache__
*.egg-info
*.iml
.eggs
.DS_Store
architecture/*.html
*~
Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... failed with initial frozen solve. Retrying with flexible solve.
Collecting package metadata (repodata.json): ...working... done
Solving environment: ...working... failed with initial frozen solve. Retrying with flexible solve.
# Workspace Prototype 0
This package is a prototype of the workspace system. It is intended to:
- Demonstrate unit, architectural and integration tests
- Demonstrate an end-to-end flow of information from a capability request through
execution on the HTCondor cluster
- Answer questions about the design
It is not intended to be:
- Feature complete
- Easy to use
- Beautiful
- Useful
In fact, steps have been taken to ensure that this will not be useful in the long-term.
## Building
Make sure you have Conda installed.
1. `conda create -n wksp0 python=3`
2. `conda activate wksp0`
3. `python setup.py develop`
## Running Workflows
Make sure you have the software built, and be on the machine `testpost-master`.
1. `run_workflow grep-uniq '{"search":"username"}' /home/casa/capo/nmtest.properties`
The workflow will execute but it is not currently smart enough to know when the workflow is complete.
### HTCondor notes
#### Running jobs
##### Transferring files
1. `transfer_input_files = ...` does not cause files to be transferred unless `should_transfer_files = YES` or `should_transfer_files = IF_NEEDED` is set.
2. If `should_transfer_files = [YES|IS_NEEDED]`, your `executable = ...` will also get transferred. If the OS does not match, this will lead to interesting problems.
As a result of this discovery, it seems to be wise to *always* supply a shell script as your executable (to avoid platform issues).
##### Logging
HTCondor will not write a log file to /tmp. I'm not entirely sure why this is but for now it seems to be prudent to
assign your log files to network filesystems.
#### Running DAGs
One significant good thing here: a file mentioned in `output = ...` or `transfer_output_files = ...` can be
mentioned in `transfer_input_files = ...` in a subsequent job (a `CHILD` in the DAG) appears to work correctly.
##### Logging
`condor_submit_dag` always writes a workflow logfile to the supplied dag filename + `.dagman.log`. This is
in the same format as the regular HTCondor log files and can be parsed the same way. The entire workflow should
create the following events: `SUBMIT`, `EXECUTE`, `JOB_TERMINATED`. Between `EXECUTE` and `JOB_TERMINATED`,
events will appear in the job log files.
It's totally OK to have all the jobs writing events to the same `condor.log`; this is the way I have it set
up currently. Each job will produce a sequence that looks similar to the workflow, but appears to include
an extra `IMAGE_SIZE` event for some reason. However, we get a flow of events that looks something like this:
┌────────────────┬─────────────────────────────────────────┐
│ time │ -1-2-3-4-5-6-7-8-9-0-1-2-3-4-5-6-7-8--> │
├────────────────┼─────────────────────────────────────────┤
│ foo.dagman.log │ S E JT │
│ condor.log │ S E IS JT ... S E IS JT │
└────────────────┴─────────────────────────────────────────┘
I suspect handling this in Python is going to be a little stupid, probably involving two threads
sending events to something which is acting as a generator.
## Testing
1. `conda activate wksp0`
2. `python setup.py test`
## Testing TODOs
- ☑ Unit tests
- ☑ Architectural tests
- ☑ Integration tests
### Architectural Tests
The main idea here is to have unit tests that prevent or at least loudly call
attention to changes to the design and architecture of the system. They do not
have to be perfect, they just have to plausibly prevent new interface methods
from appearing and violating separation of concerns.
At the moment, we have one in `tests/architectural/test_ifaces.py` which shows
how to ensure that an interface has exactly two methods on it, with the
requisite arguments.
## End-to-end flow
## Design questions
Building the prototype has revealed a few more questions about the design.
1. What is the interface between the capability engine and the rest of the system?
2. How do supplied product locators make it into the capability steps? Or do they?
3. Are living threads needed for capability execution, or can we find an event-driven solution that won't require
them?
It seems as though in the prototype we need a thread to execute the capability and another thread to catch events
to walk the capability step goes through it's own state model. This is the kind of tricky thing that would probably
be helpful to developers to see spelled out in a prototype, but also seemed likely to become a deep rabbit hole, so
I sort of dodged the question for the prototype.
one sig CapabilityService {}
sig CapabilityEngine {}
sig CapabilityStep {}
sig CapabilityRequest {}
one sig WorkflowService {}
one sig HTCondor {}
sig DagmanWorkflow {}
#+TITLE: Response to the Critical Design Review Charge
#+AUTHOR: Daniel K Lyons <dlyons@nrao.edu>
#+DATE: 2020-01-23
* Introduction
Quoting from [[https://open-confluence.nrao.edu/display/Arch/Workspaces+Critical+Design+Review][Workspaces Critical Design Review]]:
#+BEGIN_QUOTE
The panel is charged with assessing the readiness of the SSA Workspace system to begin implementation, in particular:
- Are the L1 requirements traceable to the conceptual requirements (L0)? Are L2 requirements appropriately derived from the L1s? Are there any significant gaps in the requirements?
- Does the architecture presented satisfy the requirements? Is it appropriate for the task? Are architectural choices clearly identified and motivated?
- Does the implementation team clearly understand the work to be done? Are the detailed tasks clearly defined and estimated?
- Is the implementation and integration plan sufficiently detailed and realistic?
- Is the planned testing, including unit and integration, sufficient? Is the framework for executing those tests already implemented, or planned for implementation on a realistic and suitable time frame?
#+END_QUOTE
This diff is collapsed.
#+TITLE: Workspace Architecture: Future Directions
#+AUTHOR: Daniel K Lyons and the SSA Team
#+DATE: 2019-12-09
#+SETUPFILE: https://fniessen.github.io/org-html-themes/setup/theme-readtheorg.setup
#+HTML_HEAD_EXTRA: <link rel="stylesheet" type="text/css" href="extra.css" />
#+OPTIONS: H:5 ':t
* Introduction
This document is intended to keep track of several things that don't quite fit with the other architecture documentation:
- Things that sound like requirements, but weren't
- Requirements that were missed between architecting and preparing for the CDR
- Hard, wall-like objects that we appear to be rushing towards with this architecture
* Missed Requirements
At this time, there are no requirements that are known to have been missed.
* Areas Needing Additional Design
** Reingestion
In the course of creating the work breakdown, it has become clear that there is a need for a deeper understanding of reingestion. There is a technical problem here, in that the product locator system prevents us from safely implementing reingestion as a simple delete followed by an ingestion. There are several different semantics that need to be understood here; perhaps an earlier ingestion may have been corrupted or incomplete somehow, or perhaps we are really ingesting an improved version of something but want to retain the old one for some reason. The requirements here need to be sussed out in more depth before it can really be designed and implemented, and for this reason it is a bit vague.
** Build and deployment
As the SSA group is in the midst of reviewing our core processes in advance of implementing workspaces, we see a need to treat the build system and deployment system for the workspace as component-level work deserving of the same kind of attention as the code itself. Testing has been a major concern for our planning since early in the process, but the emphasis on build and deployment is new. A certain amount of research is expected to be needed to figure out the right process here. The late-breaking development here is simply raising the importance of these issues, with the expectation that more details will be coming soon.
** Parameter Validation
At the moment there is no parameter validation phase in the system. The architectural assumption here is that the UI will do its own validation and prevent users from doing nonsensical things. The UI is likely to generate its own parameter validation service, but since it isn't a first-class architectural entity at the moment, it won't be available for these systems to utilize. It seems likely that we will want to promote this to a first-class entity so that it can be used by the capability service itself to validate requests as they come in, even from other archive and workspace systems.
A future feature that might motivate more design here would be parameters whose values are influenced by the chosen products themselves. Some CASA parameters have sensible values that vary with different data files, for instance. There's nothing in the current design to locate these values or validate them, and that is something we will probably be asked to revisit in the future.
* Non-Requirements
** Capability typing
I find it useful to mentally think of the type signatures of workflows and capabilities, if they were just like ordinary programming language artifacts. In this regime, capabilities are clearly functions from products to products and workflows are clearly procedures. In term of Java, you could see workflows and capabilities as having types like:
#+BEGIN_SRC java
void workflow();
Product capability(Product input);
#+END_SRC
This leaves some work for the future:
- How do we handle checking the types of capability inputs?
You can't image a calibration table or generate a calibration table from an image, for instance.
- How do we verify the type of the object input and other inputs?
- How do we handle multiple products, such as for downloads? There's only one product slot.
- How do we handle capabilities that need more than one product, each with different semantics?
For instance, calibrate this raw data with this calibration table?
** CASA version-specific UI
It's true that different configuration for different CASA versions is possible within one capability. However, there is nothing in the system to modify the UI depending on the CASA version, in principle.
In practice, the UI parameter components are so completely independent, you could put conditional logic in them based on the CASA version that is chosen, as long as the receiving side is able to handle it. So if you only put key/value into the parameter when the CASA version is X, you'd better only use that key inside the override template for CASA version X. This may not be super fun to debug though, so it may be better to pretend you cannot do this.
** Self-healing
This is not currently in-scope. Ingestion is a workflow, because it does not begin with a product. Reingestion, however, can be a capability because it begins with a product and ends with a product (the same product). So self-healing by reingestion can come into it here.
** Custom triggering
It was realized fairly late in the design process that we are assuming a fairly straightforward replacement of some hard-coded rules in the existing archive rules engine (amygdala) with some other hard-coded rules in the same location to address the workspace system instead. Handling this properly with some flexibility would be a good idea, but there did not seem to really be a motivating requirement.
** Auto-follow-ons
What if I pick raw data and want an image? In the current design, I have to set up the calibration or restore of that raw data, then I can send a follow-on request for an image from that calibrated MS. There is no way to do both of these in a single go.
I anticipate that implementing this feature on the current design should not be super hard. But it isn't in-scope at this time.
** Pre-emption
In this design, time-critical projects are flagged as such, and a time-critical standard calibration will always be chosen to run before a non-time-critical standard calibration. This is because the capability queues are priority queues.
There is no pre-emption in this design. This means that the arrival of a time-critical calibration will not cause a running non-time-critical calibration to be stopped or cancelled. The time-critical calibration still has to wait for whatever processing is currently running to finish. It should be the next thing executed though—unless there are more than N time-critical calibrations ahead of it in the queue, where N is the concurrency limit for this capability.
** Cross-queue priorities
This design does not address priorities across different capabilities. There is no way, for instance, to specify that standard calibrations should be run preferentially over AUDI requests. There simply isn't anything above the capability queues to make decisions like this; each queue will happily launch up to its concurrency limit of workflows.
However, because our wrapping for HTCondor is very flexible, we can probably fake this effect with HTCondor even though it isn't surfaced in the architecture. In the definition of the workflow templates for HTCondor, we can add labels and conditions which HTCondor can be configured to use to create the effect of cross-queue priorities.
#+TITLE: Workspaces: Introduction for the CDR Panel
#+AUTHOR: Daniel K Lyons
#+DATE: 2020-01-30
#+SETUPFILE: https://fniessen.github.io/org-html-themes/setup/theme-readtheorg.setup
#+HTML_HEAD_EXTRA: <link rel="stylesheet" type="text/css" href="extra.css" />
#+OPTIONS: H:5 ':t
* Intro to the Panel
Welcome to the panel for the critical design review of the SRDP Workspace System. The following documentation should be useful to you for fulfilling your charge:
1. L0->L1 requirement traceability is visible in Cameo, exported into [[./L0-L1-mapping.pdf][this PDF document]].
2. L1->L2 requirement traceability is also visible in Cameo, exported into [[./L1-L2-mapping.pdf][this PDF document]].
3. The architecture is documented in the [[./Overview.org][Overview]] document
4. L2 requirements are expected to take the form of JIRA tasks derived from the tasks listed in the [[https://open-confluence.nrao.edu/display/WSCDR/DRAFT%3A+Workspaces+System+Implementation+Planning][Workspaces System Implementation Planning]] document.
5. Known gaps in the requirements are documented in the [[./Futures.org][Future Directions]] document.
6. The relationship between the requirements and the architecture is presented in the [[file:Overview.org::*Requirement%20Satisfaction][Requirement Satisfaction]] section of the Overview.
7. The relationship between the requirements and the implementation plan is presented in the [[https://open-confluence.nrao.edu/display/WSCDR/DRAFT%3A+Workspaces+System+Implementation+Planning][Requirements Gap Analysis]] section of the implementation planning document.
8. The architect asserts that this architecture is the simplest that accounts for the requirements provided and inferred.
9. The architectural decisions and their rationale are documented in the [[./Design-Iterations.org][Design Iterations]] document.
10. The implementation team as a whole understands the work to be done. Each developer understands their role in the system.
11. [[./Overview.org::*Testing Plan][The testing plan]] is part of the architecture overview.
File added
File added
#+TITLE: Workspace Architecture: Overdesigned
#+AUTHOR: Daniel K Lyons
#+DATE: 2019-12-09
#+SETUPFILE: https://fniessen.github.io/org-html-themes/setup/theme-readtheorg.setup
#+HTML_HEAD_EXTRA: <link rel="stylesheet" type="text/css" href="extra.css" />
* Introduction
This file is a eulogy for an overdesigned alternate system for managing capabilities.
** Capabilities
A capability forms an arrow. Capabilities are well-typed and composable. The type system is based on the input and output product types.
The capability sequence is the implementation of the capability. It is an arrow, composed of constituent arrows like require-parameter, require-product, run-workflow. Workflow executions input and output types as well.
The capability arrow is compiled to a state machine.
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment