Skip to content

Export workflow metrics

Daniel Lyons requested to merge export-workflow-metrics into 2.7-DEVELOPMENT

This MR adds the following three metrics.

First, some general metrics about route performance and calls, which looks like this:

# TYPE http_request_timing summary
http_request_timing_count{route_name="get_healthcheck",status_code="200"} 41.0
http_request_timing_sum{route_name="get_healthcheck",status_code="200"} 0.10483574867248535
http_request_timing_count{route_name="create_workflow_request",status_code="200"} 1.0
http_request_timing_sum{route_name="create_workflow_request",status_code="200"} 0.0783846378326416
http_request_timing_count{route_name="submit_workflow_request",status_code="200"} 1.0
http_request_timing_sum{route_name="submit_workflow_request",status_code="200"} 0.1897449493408203
http_request_timing_count{route_name="create_and_submit_workflow_request",status_code="200"} 5.0
http_request_timing_sum{route_name="create_and_submit_workflow_request",status_code="200"} 0.5307431221008301

In the Capability service, a new metric is exported about the state of the capability queues, which look like this:

capability_queue_total{capability="restore_cms",status="executing"} 0.0
capability_queue_total{capability="std_restore_imaging",status="waiting"} 0.0
capability_queue_total{capability="std_restore_imaging",status="executing"} 0.0
capability_queue_total{capability="null_dag",status="waiting"} 0.0
capability_queue_total{capability="null_dag",status="executing"} 0.0
capability_queue_total{capability="std_calibration",status="waiting"} 0.0
capability_queue_total{capability="std_calibration",status="executing"} 2.0
capability_queue_total{capability="std_cms_imaging",status="waiting"} 0.0
capability_queue_total{capability="std_cms_imaging",status="executing"} 0.0
capability_queue_total{capability="test_download",status="waiting"} 0.0
capability_queue_total{capability="test_download",status="executing"} 0.0
capability_queue_total{capability="null",status="waiting"} 0.0
capability_queue_total{capability="null",status="executing"} 2.0

Prometheus appears not to really know the difference between ints and floats, but this lets us collect on every reporting interval a new copy of this report, which could be useful for debugging.

Finally, in the Workflow service, we have the number of running wf_monitors:

# TYPE wf_monitors_running gauge
wf_monitors_running 0.0

This was obtained by wrapping the Popen call to wf_monitor and dispatching a thread to wait on them to decrement the number. I have watched this work with 5 concurrent workflow requests, so it seems to be legit.

Merge request reports