************ Known Issues ************ Bugs ==== System as a Whole ----------------- - When starting the system, you must wait for the workflow container to run its startup script (creating all the PEXes) or else it will brick itself and you'll have to restart it. There may be a way of not bringing up the server until all startup scripts finish, or some other way to solve this. Messaging System ---------------- - Occasionally, the delivery message for a request is not triggered by the workflow-complete message Capability System ----------------- - Executions that are queued beyond the concurrency limit seem to be lost and never executed, possibly due to engines not looking for new executions once freed Gripes ====== Docker ------ - Sidecar for visibility into what containers are running; docker logs (can maybe use Prometheus built-in to gitlab) - `WS-425 <https://open-jira.nrao.edu/browse/WS-425>`__ Condor ------ - Get the data copy plugin from SCG → repo - `WS-415 <https://open-jira.nrao.edu/browse/WS-415>`__ - Update the wf_monitor to recognize other Condor status codes - `WS-413 <https://open-jira.nrao.edu/browse/WS-413>`__ Docs ---- - Setup for development page → update for docker containers - Move that info into the installation page - Update the README.md files to say something about what they're attached to - Integrate the README.md files into the docs, maybe the API docs themselves - `WS-428 <https://open-jira.nrao.edu/browse/WS-428>`__ Testing ------- - Audit code for missing tests, irrelevant tests - See if we can make coverage combination less finicky - Optimize run-test.sh to not run redundant tests - Fix the end-to-end tests that Nathan disabled because they are hard-coded for the redirect to the request page - Add schema migration to CI - `WS-435 <https://open-jira.nrao.edu/browse/WS-435>`__ Database -------- - need to generate an archive "core sample" - copy of the archive database schema - data from ~10 small projects - Consider moving from json to jsonb datatype - `WS-441 <https://open-jira.nrao.edu/browse/WS-441>`__ Pipeline -------- - Update the end-to-end test container to see how detailed we can be - Prevent `cleanup` stage from deleting tagged images when multiple pipelines are running; this issue causes `push` stage to fail Code Tweaks ----------- - `wf_monitor`: Support for more HTCondor event codes and support for them within the system .. code-block:: python # Enum example by Daniel class HTCondorEvent(Enum): def __init__(self, code: int, meaning: str, terminator: bool): self.code, self.meaning, self.terminator = code, meaning, terminator # then decoding it looks like HTCondorEvent[code] and you can ask questions like if HTCondorEvent[code].is_terminator: … SUBMITTED(0, 'executing', False) EXECUTING(1, 'executing', False) ... TERMINATED(5, 'terminated', True) - Hardcoded 48 GB of RAM in the calibration template; needs to use a Capo profile - See if Mustache can access Capo properties without much extra work - `WS-447 <https://open-jira.nrao.edu/browse/WS-447>`__ - `metadata.json`: Rename fields to be more descriptive and accurate of what the values represent Dependencies and Overall Structure ---------------------------------- - Move implementations of services out of shared/workspaces and into the relevant services - Separate interfaces for separate services into separate packages so that we can be sure that the workflow service doesn't even have access to the capability interfaces - Can REST API implementations of these interfaces be created? - If so, can those REST API implementations become dependencies of e.g. the capability service?