Task Management Service
EWMS’s Task Management Service (TMS): The HTCondor Interface
- Keywords
- WIPAC · IceCube · Observation Management Service · Event Workflow Management System · EWMS · task · Task Management Service
- URLs
- Homepage · Tracker · Source · Documentation
The TMS is the central component responsible for communication between the WMS and an HTCondor pool. It runs on an HTCondor Access Point (AP). This service:
Starts HTCondor clusters for new taskforces (1:1), see taskforce.
Stops HTCondor clusters (
condor_rm) when necessary.Watches HTCondor clusters, snapshots taskforce-level stats, and relays information to the WMS.
Overview
In short, the TMS receives its instructions from the Workflow Management Service (WMS).
Starting and Stopping Taskforces/Clusters
Internally, the service makes routine calls to the WMS to determine whether to start or stop clusters for specific taskforces.
Watching the Job Event Logs
Concurrently, the service sends updates to the WMS for each taskforce in a job event log. Taskforces share a job event log if they start on the same day. A new file is created as needed, and files are deleted after a period of inactivity.
For statelessness, when the TMS restarts, snapshot’d taskforce updates will be re-sent to the WMS, which handles these appropriately.
How to Build
The image-publish.yml GitHub Actions workflow publishes this package as an Apptainer image in CVMFS when a new release is made.
How to Run
In production, a TMS instance runs on an HTCondor Access Point (AP) using systemd. Files for this are in tms-prod/ and tms-dev/, as well as additional helper scripts in resources/systemd/.
Whichever systemd variant you choose, a envfile is required. The file for tms-prod looks something like (minus the redactions):
EWMS_ADDRESS="https://ewms-prod.icecube.aq"
EWMS_CLIENT_ID="ewms-tms-prod"
EWMS_CLIENT_SECRET="XXXX"
EWMS_TOKEN_URL="https://keycloak.icecube.wisc.edu/auth/realms/IceCube"
JOB_EVENT_LOG_DIR="/.../tms-prod/jobs"
TMS_ENV_VARS_AND_VALS_ADD_TO_PILOT="_EWMS_PILOT_APPTAINER_BUILD_WORKDIR=/srv/var_tmp/"
TMS_WATCHER_INTERVAL="15"
How to Update in Production
Use the helper script, update_tms_image_symlink.sh, to roll out a new TMS version on an HTCondor Access Point (AP) using systemd:
ewms@sub-2 ~/resources/systemd/tms-dev $ ./update_tms_image_symlink.sh v1.2.3
EWMS Glossary Applied to the TMS
Workflow
Is not relevant to the TMS. Compare to WMS.
Task
A task is not a first-order object in the TMS. However, each taskforce holds references to a container, arguments, environment variables, etc. In other words, the WMS supplies the TMS with only the task’s relevant information. Compare to WMS.
Task Directive
Is not relevant to the TMS. Compare to WMS.
Taskforce
The taskforce is the primary object within the TMS. It is associated with one HTCondor cluster. See Taskforce’s cluster_id.
Compare to WMS.
Cluster
The cluster is the realization of a taskforce within an HTCondor pool. The two are mapped 1:1 and are nearly synonymous at a high level.
However, the term “cluster” is used exclusively within the context of an HTCondor pool, the job event log, and debugging. Unlike the taskforce, the cluster is not relevant in the broader EWMS context.
Bump semver release test 1